1 Top of the world

https://brand.wsu.edu/visual/colors/

1.1 TESTING PROCEDURE

This midterm exam (Rnotebook-midterm) is worth 150 points. For each question, review how much the item is worth in terms of points and plan your time wisely.

I would deem it “unwise” to spend hours on a question that is only worth 5 points.

1.1.1 Static/Existing Resources Are Allowed

This is an open-book examination. You can use your course notebooks (digital and old-school). You can use Internet resources (stackoverflow, Wikipedia, and Youtube).

1.1.2 Dynamic/Living Resources Are _ NOT_ Allowed

You cannot use a living resource on the exam. That would include a classmate, student, sibling, parent, a tutor, online forums (where you ask the question after the exam period has begun).

If you have questions that need clarifying, please email the instructor and he will try to answer them by email or ZOOM. He will be checking his email often during the week of the exam, to make himself available to you.

1.1.3 Levels of Mastery

  • Do You Remember?
  • Do You Understand?
  • Can You Apply what you remember/understand to another similar problem?
  • Can You Analyze and Synthesize Data?
  • Can You Evaluate your analyses?
  • Can You Create meaningful visualizations and summaries?

1.1.4 Rubric of Mastery

For every 10 points, this is the general breakdown.

Emerging Developing Mastering
0-4 5-7 8 - 10

2 EXPLORATORY DATA ANALYSIS (EDA)

Exploratory data analysis (EDA) is the process of analyzing data to summarize its main characteristics.

Confirmatory data analysis (CDA) is the process of applying specific statistical methods to analyze the data. The goal is also to summarize its main characteristics. We commonly refer to CDA as “statistical hypothesis testing”. CDA generally makes assumptions about how the data is distributed. Or wants to apply a specific model to the data.

So the two approaches have the same objective: summarize the main characteristics of the data. How they achieve that objective is very different.

2.1 Introduction

2.1.1 John Tukey

John Tukey is the father of “Exploratory Data Analysis” (EDA) https://en.wikipedia.org/wiki/John_Tukey. My favorite statistics book is his 1977 book (not surprising) entitled “Exploratory Data Analysis.”

2.1.2 Exploratory vs Confirmatory

From Wikipedia (Accessed October 2020): Tukey “also contributed to statistical practice and articulated the important distinction between exploratory data analysis and confirmatory data analysis, believing that much statistical methodology placed too great an emphasis on the latter.”

I belong to the “Tukey” camp. I believe too much emphasis is place on “formal statistical methods and tests”. I believe more emphasis should be placed on the underlying nature of the data. These underlying principles are how the statistical methods developed.

As a data analyst, I believe that first and foremost, we should let the data speak. That is why the first half of the semester started in this form. The second half (confirmatory data analysis) will rely on what is labeled by many as “formal statistical methods”.

2.1.3 Tukey and “Bell Labs”

In 1965, Tukey divided his time between working at Princeton University and working at Bell Labs (a research think tank).

2.1.3.1 Robust Statistics as Nonparametric

Tukey proposed that five summary data are essential to understanding numerical data: min, max, median (technically Q2), and Q1 and Q3 (the quartiles). In R, the function summary has only added mean to Tukey’s proposal from years ago.

2.1.3.2 Box and Whisker Plot

In 1975, Tukey invented the “box and whisker” plot that identifies the median, inter-quartile range (IQR), and outliers of data. The visualization displays the data without making any assumptions about its statistical distribution. The boxplot is a working demonstration of EDA. Let the data speak!

2.1.4 John Chambers and S and R

At the same time ast John Tukey, three other men were also working at Bell Labs (John Chambers, Rick Becker, and Allan Wilks) on a statistical programming language S that emphasized EDA. This “statistical computing” language was programming mostly in Fortran with some C programming. Chambers published his first “statistical computing” text in 1977, titled “Computational methods for data analysis” https://archive.org/details/computationalmet0000cham/page/n11/mode/2up

Between 1988 and 1991, Chambers updated the engine of S to make it more robust. That same engine still powers much of R today. That is, much of the base code of S was written by Chambers himself. R today still uses much of that S codebase under the hood.

R was an open-source offshoot (a “fork”) of S which occurred in the mid 1990s. Today, Chambers is still active in the S-nowR community. My favorite book of his is titled (2008): “Software for data analysis programming with R”. My second-favorite book of his is titled (1998): “Programming with data: a guide to the S language”. In 2016, he authored another book that I still need to read “Extending R.”

Modern R is written in Fortran, C, and C++.

Since its foundation is primarily C, we can use standard “make” and “make-install” tools to compile R or its packages from the source code. That is why we needed Rtools on Windows. The MacOS is now linux based, so no additional tools are required.


2.2 Summary

EDA as exploration is an iterative process.

2.2.1 Analogy: learning a foreign language

I like to use the analogy of learning a foreign language using the “immersion” approach. For example, I studied Spanish in high school, learned vocabulary and grammar, and really could not speak the language well.

I did learn to speak the language well by being dropped into a foreign country for nearly two years. Some key ingredients to learn a language in an “immersive” environment are listed below:

  • surround yourself with others speaking the language to be learned [e.g., I did not spend a lot of time with other Americans speaking English].

  • be present when engaged in the language. Listen intently and try to understand as much as you can, not worrying about what you don’t fully understand.

  • reflect after language engagement. Try to synthesize what “gaps” you have and then develop study habits to fill in those gaps.

  • practice what you have learned.

To some degree, my success was likely accelerated because I had precursory training. Regardless, “immersive” practices benefit learning new languages.

2.2.2 Proficiency in Data Analytics

[As part of your journey, I have asked you to keep a “paper-and-pencil” notebook to write down words/phrases/ideas. For example, in this section, there may be words/terms/phrases/ideas you don’t fully understand.]

Proficiency requires an iteration of these key features described above. But first, you have to understand what language you are trying to speak. Is it R? Is it Statistics? Mathematics? What exactly is the language?

In my opinion, the language is the “language of data”.

2.2.3 The “language of data”

How do you think mathematics developed? It likely started with simple data, based on real-world experience: two hands, five fingers on each hand gives me the number ten. Counting in a base-10 system likely resulted. An entire domain of mathematics called “number theory” devotes its studies to these integer values.

How do you think statistics developed? People went out and started measuring things. One person would literally walk down the street in the late 1800s and ask if he could measure a person’s proportions. Another person would study crop yields at different locations and tried to ascertain if they were different.

The foundation of mathematics and statistics is data. So I believe, we should let the data speak.

2.2.4 Let the data speak

So I am definitely an EDA-guy. Some people are, some people are not. I personally am a strong believer that we should let the data speak, learn how to describe the data without imposing any restrictions on it, and always think about the data first and foremost.

I also believe that we should use logic, intuition, and insight before we develop any formal “confirmatory” hypothesis testing. I have intentionally architected this course to emphasis EDA.

2.2.5 Quality data provenance

I am also very ademant about data provenance as I believe the “outputs” of any analysis (whether exploratory or confirmatory) is as only as good as the data quality. I call this data intimacy. You should care just as much about the process to get quality data as you do to analyze said data.

The term GIGO (garbage-in, garbage-out) in my estimation represents what happens when care for quality data is treated lightly.

2.2.6 Iterative Exploration

This full EDA approach is a multi-lens approach. View the data from as many different perspectives as possible before arriving at a conclusion. Base your conclusion on a synthesis of what you analyzed from those different perspectives.

  • We do initial EDA (using mathematical foundations),
  • then we may do confirmatory analyses (traditional statistical methods), and
  • then we synthesize our findings and do a higher-ordered EDA using the original analysis and the confirmatory analysis to make final decisions using sound logic and intuition.
  • this process will enlighten our understand and possibly help us formulate new suppositions and think about what additional data would inform the topic.

2.3 (10 points) YOUR “EDA” OPINION

[ I have expressed my opinion about the study of data and the importance of EDA in that study. What is your opinion on this topic?

This is worth 10 points, a minimal answer should be at least 3 paragraphs. Agreeing/Disagreeing with my opinion is not how you will be evaluated. How well you express YOUR opinion is what is important.]

The process of exploratory data analysis (EDA) really allows one to get to know the data that they are working with, which can provide more insights than may have been initially expected. The more familiar someone is with the data they are working with, the easier it will be for them to summarize and draw conclusions on that data. EDA is also important for identifying trends that may be present in the data. 

Exploratory data analysis may provide awareness as to which data should be kept and which can be discarded. It can help identify unique characteristics such as important (or not so important) variables and the relationships between each of the variables in the data. This is a crucial step in cleaning and organizing data to tailor it to the needs of the analysis, or research question. EDA is also an important tool in developing models from data.

I think that the idea about “letting the data speak” as pointed out above is really interesting and something that I haven’t thought much about. Being able to explore the data without restricting it allows one to maximize what they learn from the data. This can result in the discovery of relationships and questions to later be explored. 

2.4 SIMULATING DATA

2.4.1 (10 points) Basic Simulation

  • Pick a set.seed choice so the code is replicable. Verify that every time you run the commands, the data is not changing with the seed “you chose”.
  • Use the functions rnorm, runif to simulate data.
  • Simulate n=9999; data for each.
  • Call x.rnorm the data for the first and x.runif the data for the second.
  • Plot a histogram graphics::hist and report the summary statistics `base::summary of each.
  • Then, plot them using plot(x.rnorm, x.runif);.
  • Finally, plot(x.rnorm, sample(x.rnorm) ); and compare it to plot(x.runif, sample(x.runif) );.

2.4.1.1 Code of simulation

set.seed(99);
n = 9999;

x.rnorm = rnorm(n, mean=0, sd=1);
x.runif = runif(n, min=0, max=1);

hist(x.rnorm);

hist(x.runif);

summary(x.rnorm);
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -3.514122 -0.663744 -0.021075  0.002495  0.683299  4.037498
summary(x.runif);
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.0000238 0.2551734 0.4975435 0.5006161 0.7535215 0.9998405
plot(x.rnorm, x.runif);

plot(x.rnorm, sample(x.rnorm));

plot(x.runif, sample(x.runif));

2.4.1.2 Describe rnorm, runif

  • Describe what each function rnorm and runif does. How are they similar? How are they different?
  • What does the sample function do?
  • How was plot(x.rnorm, x.runif); different from plot(x.rnorm, ( x.rnorm.sample = sample(x.rnorm) ) ); and plot(x.runif, ( x.runif.sample = sample(x.runif) ) );? How would you describe the shape of each of these plots?
The function `rnorm` takes as parameters the number of observations to be generated, n, and the mean and standard deviation of the distribution to be simulated. This function results in a vector of n normally distributed random numbers. 
The function `runif` also takes as parameters the number of observations to be generated, n, but rather than a mean and standard deviation, this function has the arguments min and max, which are the lower and upper bounds of the distribution. Similar to `rnorm`, `runif` results in a vector of random numbers, but these numbers follow a uniform distribution. 

Both of the above functions produce a vector of n random numbers and are useful for simulating data. These functions can also both be used for random number generation, it is just important to keep in mind that there will be a certain pattern to the generation. The most prominent difference between the two is the distribution that the random numbers follow, whether it be uniform or normal.

The `sample` function results in a random sample of elements from the vector that it is passed as an argument. If the size argument of the function is not specified, as seen in the example above, the sample contains the exact same elements as the original vector, but they are likely in a different order. The size of the sample can be either smaller or larger than the original vector, but if it is larger, replacement is required in the sampling. By default, sampling using this function is set to be performed without replacement.

The plot generated by `plot(x.rnorm, x.runif)` displays a plot in which random data points from a normal distribution are plotted against random data points from a uniform distribution. When looking at the data points with respect to each axis, it can be observed that the points are relatively evenly distributed between 0 and 1 on the y-axis (representing the uniform distribution), whereas on the x-axis (representing the normal distribution) the points are more densely located near 0, which is the mean of the distribution, and the further from 0 the points get, the more sparse they are. 

The plots generated by `plot(x.rnorm, sample(x.rnorm))` and `plot(x.runif, sample(x.runif))` both represent the same type of distributions (normal vs uniform) plotted against each other. The shape of these plots varies from the plot described above. In the `x.runif` vs. `sample(x.runif)` plot, the points are pretty evenly distributed over the entire plot. Intuitively, this is because two uniform distributions are being plotted against each other, so the data points on both the x- and y- axes are expected to be equally proportioned. Likewise, in the `x.rnorm` vs. `sample(x.rnorm)` plot, the data points on both the x- and y- axes are more densely populated around the mean of 0. The results in a plot with points very concentrated (in a circular fashion) around the coordinates (0, 0), and the further from that point you travel, the less data points you are likely to see.  

2.4.2 (5 points) “Easter-Egg” Simulation

  • There was an “Easter Egg” that related to setting the seed set.seed using rbinom. If you search in the BlackBoard discussion forum for easter you will see the discussion about August 25-27.
  • In the “Easter Egg”, the goal was to find a scenario using a specific set.seed that would simulate flipping a coin 100 times and getting one result (heads/tails) exactly 52 times.
  • In this problem, the search criteria has changed. Simulate flipping a coin 1000 times and getting one result (heads/tails) exactly 555 times.
  • You need to report 5 values for set.seed that achieves this objective. You can report more.
  • You should explicitly have the code print length(x) where x is a vector of the values that meet the objective.
n = 500000;
seeds = numeric(0)

for (i in 1:n)
{
  set.seed(i);
  result = rbinom(n = 1, 1000, 0.5);
  if (result == 555)
  {
    seeds = append(seeds, i);
  }
}

seeds;
##  [1]  49771  58927  63153  64693  74842 101575 103323 113838 121830 152054
## [11] 168654 178390 211784 225908 253759 298626 303566 316746 317968 325175
## [21] 400407 491595 492805 498906
length(seeds);
## [1] 24
cat('Length of the vector that contains seed values from 0 to', n, 'is:', length(seeds));
## Length of the vector that contains seed values from 0 to 500000 is: 24
cat('The', length(seeds), 'values between 0 and', n, 'for set.seed that achieve getting one result exactly 555 times in 1,000 coin tosses are:', paste(seeds, collapse=', '))
## The 24 values between 0 and 500000 for set.seed that achieve getting one result exactly 555 times in 1,000 coin tosses are: 49771, 58927, 63153, 64693, 74842, 101575, 103323, 113838, 121830, 152054, 168654, 178390, 211784, 225908, 253759, 298626, 303566, 316746, 317968, 325175, 400407, 491595, 492805, 498906

2.4.3 Rolling the Dice

  • You have 3 dice.
  • Each dice has the numbers 1:10 … they are ten-sided die (“decader” die).
  • Write the necessary for-loops to capture all possible outcomes of rolling the three dice at the same time.
  • A dataframe myrolls should have three columns: dice.1, dice.2, dice.3 plus a fourth column roll.total which is the sum dice.1 + dice.2 + dice.3 of one iteration of the nested for loop.
  • Report the dimensions dim of myrolls.
  • Create a table outcomes.table that summarizes the counts of the myrolls$roll.total
  • Transform the table to a dataframe outcomes.df. Name the columns: c(“roll.total”, “count”);
  • Report the sum of outcome.df$count
  • Create a new column outcomes.df$prob (Probability) that determines the probability of that row given the total sum of the count column.
  • Display the dataframe.

2.4.3.1 Setting Up the Dice Scenario

## your code goes here ...

dice.1 = dice.2 = dice.3 = 1:10;

myrolls = NULL;
for(d1 in dice.1)
  {
  for(d2 in dice.2)
    {
    for(d3 in dice.3)
      {
      roll.total = d1 + d2 + d3;
      row = c(d1, d2, d3, roll.total);
      myrolls = rbind(myrolls, row);
      }
    }
  }

myrolls = as.data.frame(myrolls);
colnames(myrolls) = c("dice.1", "dice.2", "dice.3", "roll.total");

myrolls;
print("Dimensions of myrolls");
## [1] "Dimensions of myrolls"
dim(myrolls);
## [1] 1000    4
outcomes.table = table(myrolls$roll.total);
outcomes.df = as.data.frame(outcomes.table);
  colnames(outcomes.df) = c("roll.total", "count");


total.sum = sum(outcomes.df$count);
print("Sum of outcomes.df$count");
## [1] "Sum of outcomes.df$count"
total.sum;
## [1] 1000
outcomes.df$prob = outcomes.df$count / total.sum;
outcomes.df;

2.4.3.2 Viewing a subset of data to answer a question

  • How many ways can I roll a 23 when I throw the dice at the same time? What is the probability that I roll a 23 on a single throw?
# if you don't have the latest version of humanVerseWSU, you can access the function as follows:
# library(devtools);
# source_url(paste0( path.github, "humanVerseWSU/R/functions-dataframe.R" ));



sub.myrolls = subsetDataFrame(myrolls, "roll.total", "==", 23);

sub.myrolls;
sub.outcomes.df = subsetDataFrame(outcomes.df, "roll.total", "==", 23);

sub.outcomes.df;

2.4.3.3 (10 points) Questions from the Dice simulation

2.4.3.3.1 Roll 23
  • How many ways can I roll a 23 when I throw the dice at the same time? What is the probability that I roll a 23 on a single throw?
When the three dice are thrown at the same time, a 23 can be rolled 36 different ways. The probability that a 23 will be rolled on a single throw is 0.036.
2.4.3.3.2 Roll 12 or 22
  • What is the probability that I roll a 12 or a 22 on a single throw?
sub.outcomes.df = subsetDataFrame(outcomes.df, "roll.total", "==", 12);
sub.outcomes.df = rbind(sub.outcomes.df, subsetDataFrame(outcomes.df, "roll.total", "==", 22))
sub.outcomes.df;
sum(sub.outcomes.df$prob)
## [1] 0.1
The probability of rolling a 12 or a 22 in a single throw is 0.1. 
2.4.3.3.3 Roll 26 or 29
  • What is the probability that I roll a 26 or a 29 on a single throw?
sub.outcomes.df = subsetDataFrame(outcomes.df, "roll.total", "==", 26);
sub.outcomes.df = rbind(sub.outcomes.df, subsetDataFrame(outcomes.df, "roll.total", "==", 29))
sub.outcomes.df;
sum(sub.outcomes.df$prob)
## [1] 0.018
The probability of rolling a 26 or a 29 in a single throw is 0.018.
2.4.3.3.4 Roll 3 once/twice in a row
  • What is the probability that I roll a 3 on a single throw? What is the probability that I roll a 3 twice in a row? First throw = 3 AND second throw = 3?
sub.outcomes.df = subsetDataFrame(outcomes.df, "roll.total", "==", 3);

sub.outcomes.df;
prob.twice.in.row = (sum(sub.outcomes.df$prob))^2;
prob.twice.in.row;
## [1] 0.000001
The probability that a 3 will be rolled on a single throw is 0.001. The probability that a 3 will be rolled twice in a row is 0.001*0.001, or 0.000001.
2.4.3.3.5 Roll 12 or lower
  • What is the probability that I roll at most a 12 on a single throw? That is, a 12 or lower …
sub.outcomes.df = setNames(data.frame(matrix(ncol = 3, nrow = 0)), c("roll.total", "count", "prob"))
  
for (i in 3:12)
{
  sub.outcomes.df = rbind(sub.outcomes.df, subsetDataFrame(outcomes.df, "roll.total", "==", i));
}

sub.outcomes.df;
sum(sub.outcomes.df$prob)
## [1] 0.22
The probability that a single throw will result in at most a 12 is 0.22.

2.5 INITIAL EXPLORATION OF REAL DATA

We are going to use exploratory techniques to examine some Indeed.com data. If you recall, this job data examines how many jobs reference a certain keyword. Every Monday morning at 12:00:00AM EST (using a scheduler crontab), this data collection is performed. A few weeks ago, I added some new keys words. The data set we have consists of 5 weeks: 2020-38 to 2020-42.

  • For each “search phrase”, I go to Indeed.com and download the first page of results.
  • From this first page, I grab the “total count”
  • An example is shown in a screenshot, taken this week.
Source: Data provenance history

2.5.1 Import “jobs” data

  • Run code to import the data jobs.
jobs = utils::read.csv( paste0(path.mshaffer, "_data_/indeed-jobs.txt"), header=TRUE, quote="", sep="|");

colnames(jobs) = c("year.week", "search.query", "job.count");
jobs;

  • Create a hist and boxplot and report summary(jobs$job.count).

2.5.2 (5 points) Histogram and Box Plot

hist(jobs$job.count, breaks = 20, main='Histogram of Job Counts')

hist(jobs$job.count, breaks = 300, xlim=c(0,20000), main='Histogram of Job Counts - Zoomed in to show less than 20,000')

boxplot(jobs$job.count, main='Boxplot of Job Counts')

boxplot(jobs$job.count, ylim=c(0, 30000), main='Boxplot of Job Counts - Zoomed in to show less than 30,000')

summary(jobs$job.count)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0     149    1734   15170   11847  404527
From the histogram, it can be gathered that this data is not symmetric, rather it is skewed to the right. A large portion of these data points fall below 20,000. This suggests that the distribution has outliers that are located to the right of the histogram (toward the upper bound of the data). When we zoom in on this histogram to show just the values that are less than 20,000, it is apparent that it is still a right-skewed distribution with nearly 400 of the total data points falling between 0 and 1,000, note that there are only 905 total job counts being graphed, so this is close to half of all of the data (approximately 400/905 = 0.4420).

The boxplot further illustrates the skewedness of the data, as well as the outliers. From the boxplot, we can also see that this distribution is positively skewed (data is more concentrated toward lower values). When looking at the zoomed in boxplot, it can be seen that the first quartile and median are both relatively close to the minimum value, and the bottom whisker is much shorter than the top whisker, further supporting that the data is not symmetric. The maximum in this boxplot looks to be between about 27,000 and 30,000, which tells us that any value over that number is considered an outlier. The median, which looks to be about 2,000, tells us that 50% of all data points fall below that number, and 50% fall above. This is consistent with the conclucions drawn from the histogram. Visully, the first quartile looks very close to the minimum value, suggesting that 25% of the data points are closely clustered around the minimum and the third quartile, which looks to fall around 12,000, tells us that only 25% of the data points are above this number and 75% are below. 

The summary of the job count column represents numerically the inferences made from the plots. The summary tells us that the minimum value is 0 and the first quartile is 149. To put into perspective how close these numbers are in the context of this data, the range between the minimum and maximum values is 404,527. This means that 25% of the total data falls within only 0.0368% of the total range between values. One piece of data that the summary gives us that the plots didn't necessarily provide is the mean of the data. This number is 15,170, whereas the median is only 1,734. This helps to illustrate the effect of the outliers on the mean of the data set. Because the mean is so much larger than the median, it is likely that the outliers are significantly larger than most of the data. In this case, the mean is even larger than the third quartile. The summary also provides us with the information that the maximum of the whole data set is 404,527. From the boxplot above, we see that this number is obviously an outlier, as the top whisker of our plot ended between 27,000 and 30,000. 

The main inferences that can be made from these three functions is that the data is skewed and there are significant outliers. This is important to keep in mind because these things can have an effect on our data analysis if we are not careful in the way that we work with them.

[What does the histogram tell you about the data? What does the boxplot tell you about the data? What does summary tell you about the data?]


2.5.2.1 Subset some keywords relevant to this course

deep.dive = c("Microsoft Office", "C++", "SQL", "Computer Science", "Python", "Java", "Statistics", "Data analysis", "Data analytics", "Javascript", "machine learning", "Git", "Tableau", "Business intelligence", "PHP", "Mysql", "MariaDB", "SAS", "SPSS", "Stata",  "Data entry", "Big data", "Data science", "Power BI");  # I intentionally do not include "R" because it is return irrelevant results, I should have searched for "R programming" or "R statistics" ... which I am now doing...


or = "";
for(search in deep.dive)
  {
  or = paste0(or, " jobs$search.query == '",search,"' | ");
  }
or = substr(or,0, strlen(or) - 2);

## TODO ... update subsetDataFrame to allow "OR" logic, currently only does "AND" ...

# jobs.subset = jobs[ or , ];  # doesn't work ...
jobs.subset = jobs[ jobs$search.query == 'Microsoft Office' |  jobs$search.query == 'C++' |  jobs$search.query == 'SQL' |  jobs$search.query == 'Computer Science' |  jobs$search.query == 'Python' |  jobs$search.query == 'Java' |  jobs$search.query == 'Statistics' |  jobs$search.query == 'Data analysis' |  jobs$search.query == 'Data analytics' |  jobs$search.query == 'Javascript' |  jobs$search.query == 'machine learning' |  jobs$search.query == 'Git' |  jobs$search.query == 'Tableau' |  jobs$search.query == 'Business intelligence' |  jobs$search.query == 'PHP' |  jobs$search.query == 'Mysql' |  jobs$search.query == 'MariaDB' |  jobs$search.query == 'SAS' |  jobs$search.query == 'SPSS' |  jobs$search.query == 'Stata' |  jobs$search.query == 'Data entry' |  jobs$search.query == 'Big data' |  jobs$search.query == 'Data science' |  jobs$search.query == 'Power BI'  , ];

# stem(jobs.subset$job.count);
# subsetDataFrame(jobs.subset, "job.count", "==", 0);

2.5.2.2 Histogram and Box Plot of Subset

hist(jobs.subset$job.count, breaks = 20, main='Histogram of Job Counts Subset')

boxplot(jobs.subset$job.count, main='Boxplot of Job Counts Subset')

summary(jobs.subset$job.count)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0   11553   20746   44306   53896  238111
From the histogram, it looks as though the subset of the data is still skewed-right, but not quite to the degree that the whole data set is. It seems that there may be a few 'clusters' of outliers around about 160,000 and 230,000. This may be explained by the fact that the data was collected multiple times over the span of 5 weeks, so the clusters may represent the 5 counts for the same keyword over the time period. The mode in this histogram is the bin between 1,000 and 2,000, showing that this is the single range in which most data points fall relative to all other ranges of length 1,000. 

The boxplot created above is consistent with the conclusions drawn from the histogram, further supporting these claims. The data is skewed, but not as much as the original data set. When looking at the median, the values below (the first quartile and minimum) are closer together that the values above (the third quartile and maximum), suggesting that data is more concentrated on the lower side of the median. The boxplot contains the same 'clusters' of outliers as the histogram suggested. The maximum of the boxplot looks to fall around 120,000, therefore all values above this number are considered outliers of the data. 

Although it was not required for this question, I also ran a summary on the subset of job counts to illustrate numerically what is going on. This summary shows that subsetting the original data got rid of some of the larger outliers, as the original data had a maximum of 404,527, and the subset has a maximum of only 238,111. The mean of the subset is still significantly larger than the median of the subset, which can be expected due to the data being skewed right and outliers falling on the upper side of the data, but the difference of the two is not quite as drastic of the difference between the two in the original dataset. The spread of the data in the subset is not as large as the original data. 

[What does the histogram tell you about the data? What does the boxplot tell you about the data?]


2.5.4 Conclusions on logical inference

Distance is a fundamental unit of comparison. We can use our “mathematical” understanding of distance. We can use an “EDA” understanding of the data (e.g., the boxplot). We need to understand the data sourcing and how that will relate to the logical conclusions we are trying to draw.

When we transition to “confirmatory inferential statistics”, we cannot leave our understanding of “maths” and “EDA” behind. They are the foundation from which “inferential statistics” is built. They are “logical inference”.


2.6 COMPUTING DISTANCES

If you recall, we had a notebook on collecting data from Wikipedia. We documented the “data-provenance” protocols to make this happen. We have documented and can replicate our data-collection strategies.


2.6.1 (10 points) Data Provenance defined

Imagine you are preparing for a job interview. Write a 90-second blurb describing “what is data provenance” and “why it matters”. I would suggest the STAR(S) approach mentioned in one of the notebooks. Reference the “Wikipedia” project as an example of how one can implement the features.

Data provenance is the care, or management, of data, from the raw, original data, to whatever present form the data may take on. Not only is data provenance the collection of the data, but it is also the documentation of how that data was organized, cleaned, and prepared in order to answer whatever research questions are posed. One goal of data provenance is to get to a place where you can start asking some questions, which are often prompted by the data itself. This is an appropriate time to map the research data to other data and formulate questions that may involve both sources of data. One importance of data provenance stems from the ability to track and recreate the entire process regarding the data (from collection to analysis). This makes it easier to find potential errors, and also allows others to recreate the analysis being performed. Overall, data provenance provides the who, what, when, where, why, and how of the production of the data in question and verifies the authenticity and reliability of the data for those who may not have been involved in collecting, organizing, and/or analyzing the data.

[What is data provenance?

Probably about 200-250 words (with the 90 second limit) ]

2.6.2 Geospatial distances

“Geo-spatial” studies are becoming much more common in the “data analytics” community, so let’s use basic “latitude/longitude” data to formally talk about “distance.”

So we will look at the 50 state capitals of America (USA). Before that, let’s examine some basic principles of distances using my hometown.


2.6.3 (5 points) Distance from one input to multiple outputs

2.6.3.1 My Hometown “Columbia Falls, Montana” cfalls

  • Find all ZIP codes within 22 miles of Columbia Falls, MT cfalls (use lat/long provide from the Wikipedia lookup)… build the bounding “box” and perform the post-hoc “radial distance” computations (as we did in the homework).
# copy/paste __student_access__/_SECRET_/_SECRET_database_.txt into console...  or this won't work

cfalls.latitude = 48.37028; 
cfalls.longitude = -114.18889;
my.radius = 22; my.units = "mi"; #miles

# THIS is where these exam functions live ...
# source_url( paste0(path.github,"misc/functions-midterm-F2000.R") );  # should be 2020 ... oh well




cfalls.info = getNeighborsFromLatLong(22, 48.37028, -114.18889, "mi");
## [1] "The QUERY returned ...  13  ... NEIGHBORS"
cfalls.info$neighbors;
############## plotting ##############
brown = "#ffe4c4";
green = "#014421";

my.state = "montana";
my.state.color = "#ffe4c4";

my.county = "flathead";
my.county.color = "#014421"; 

my.nearby.states = c("idaho", "washington", "oregon");


plotNeighbors(cfalls.info, 
                    state          = my.state, 
                    state.color    = my.state.color,
                    state.border   = 0.05,
                    county         = my.county, 
                    county.border   = 0.05,  # if you don't see the box, increase this to like 0.75
                    county.color   = my.county.color, 
                    nearby.states  = my.nearby.states); 

In the `buildBoundingBoxFromTheRadiusAndGivenLatitudeLongitude` function, the bounding box is determined by finding delta.latitude and delta.longitude and then adding those values to each direction (north, south, east, and west) of our given latitude and longitude. The values delta.latitude and delta.longitude are found by taking the ratio between the given radius and a set factor (factor.lat or factor.long), which differs whether we are dealing with latitude or longitude. When examining these variables, I noticed that factor.long is larger than factor.lat, thus, because we are dividing the radius by these factors, delta.longitude ends up being a smaller number. (Note: when finding delta.long, we are actually dividing the radius by factor.long multiplied by the cosine of delta.latitude in radians). When we add delta.long and delta.lat to each direction of the given latitude and longitude, the resulting rectangle ends up being skinnier than it is tall because delta.long is a smaller number than delta.lat.

Overall, this visualization does a good job visually portraying the information of our data frame. I appreciate that the visual shows where the bounds for the 'neighbors' are, rather than just showing the neighbors themselves. One change that I would make to this image in particular is to have more contrast between the color of the state and the bounding box. I also think that it would be helpful if, instead of a rectangle, the bound was shown by a circle. This would make more sense to me since we are looking for neighboring cities in a certain radius, implying that the bound is circular. 

[ - why is the box not a square, but a rectangle? … see factor.lat and factor.long in function buildBoundingBoxFromRadiusAndGivenLatitudeLongitude - critique the visualization … what do you like? what would make it better?]


2.6.3.2 Your Hometown of something like it

Instead of cfalls.info, you do hometown.info

  • a location in the continental US of your choosing (not in Montana, Alaska, or Hawaii). [Graphing will not work for Alaska/Hawaii, Alaska has “boroughs” not counties.]
  • find the latitude/longitude of the location you have selected (how and where to look that up?)
  • Initially start with a radius of 13 miles
  • When you run the code, note how many total “neighbors”; if it is less than 20; increase the “miles” so at least 20 results are returned.
  • In the end, you should select a location and radius that works for you. And its visualization also works.
  • Be certain to review and update the parameters before calling these functions.
sandpoint.latitude = 48.266667; 
sandpoint.longitude = -116.566667;
my.radius = 30; my.units = "mi"; #miles


sandpoint.info = getNeighborsFromLatLong(my.radius, sandpoint.latitude, sandpoint.longitude, my.units);
## [1] "The QUERY returned ...  22  ... NEIGHBORS"
sandpoint.info$neighbors;
my.state = "idaho";
my.state.color = "#9fa0a1";

my.county = "bonner";
my.county.color = "#a11818"; 

my.nearby.states = c("oregon", "washington", "montana", "wyoming");


plotNeighbors(sandpoint.info, state=my.state, state.color=my.state.color, state.border=0.05, county=my.county, county.border=0.10, county.color=my.county.color, nearby.states=my.nearby.states);


2.6.4 U.S. State Capitals (cities)

From Wikipedia, we grabbed one page that listed the 50 U.S. cities that are designated the capitals of each individual state in America (United States of America).

Using the power of a for-loop we make our functions work for us. We now have the data ready to go.

capitals = utils::read.csv( paste0(path.mshaffer, "_data_/state-capitals/final/state-capitals.txt"), 
                            skip=1, 
                            header=FALSE, quote="", sep="|");

colnames(capitals) = c("state", "capital", "latitude", "longitude", "capital.since", "area.sq.miles", "population.2019.est", "population.2019.est.MSA", "population.2019.est.CSA", "city.rank.in.state", "url");

# hack-add from https://en.wikipedia.org/wiki/ISO_3166-2:US
# TODO, grab this table "appropriately" as a new function
# Is there a dictionary for shortened city names?
# the long-field names is also an issue that needs to be improved upon in the next iteration.

capitals$st = c("AL","AK","AZ","AR","CA","CO","CT","DE","FL","GA","HI","ID","IL","IN","IA","KS","KY","LA","ME","MD","MA","MI","MN","MS","MO","MT","NE","NV","NH","NJ","NM","NY","NC","ND","OH","OK","OR","PA","RI","SC","SD","TN","TX","UT","VT","VA","WA","WV","WI","WY"); # ,"DC","AS","GU","MP","PR","UM","VI");

myLabels = paste0(capitals$capital, ", ", capitals$st);

capitals;

2.6.4.1 Initial Plotting


  • Plot the data on usmap (ggplot2)
latlong = removeAllColumnsBut(capitals,c( "state", "st", "capital", "latitude", "longitude", "population.2019.est") );

# first two elements have to be this
latlong = moveColumnsInDataFrame(latlong, c("longitude","latitude"), "before", "state");

# for transform to work
library(usmap);    
latlong.transform = usmap_transform(latlong);
library(ggplot2);

### plot_usmap ...  

plot_usmap(fill = "#53565A", alpha = 0.25) +
  ggrepel::geom_label_repel(data = latlong.transform,
             aes(x = longitude.1, y = latitude.1, label = capital),
             size = 3, alpha = 0.8,
             label.r = unit(0.5, "lines"), label.size = 0.5,
             segment.color = "#981E32", segment.size = 1,
             seed = 1002) +
  scale_size_continuous(range = c(1, 16),
                        label = scales::comma) +
  labs(title = "U.S. State Capitals",
       subtitle = "Source: Wikipedia (October 2020)") +
  theme(legend.position = "right")


  • Plot the data using maths voronoi (tripack)
colors = rainbow(50, s = 0.6, v = 0.75);

## initial visualization ...
library(tripack);
# plot( voronoi.mosaic(latlong[,4:3], duplicate="remove"), col=colors, xlab="");
plot( voronoi.mosaic(x = latlong$longitude, y = latlong$latitude), col=colors, xlab="");
text(x = latlong$longitude, y = latlong$latitude, labels = latlong$capital, col=colors, cex=0.5);


  • Plot the data on map (base)
### how is any of the other visualizations really any better than a simple map ... with actual locations for Alaska/Hawaii?
library(maps); 
map('state', plot = TRUE, fill = FALSE, 
    col = "blue", myborder = 0.5
    );
points(x = latlong$longitude, y = latlong$latitude, 
                  col = "red", pch = "*", cex = 1);


2.6.4.2 (5 points) Comparing “Visualization Options”

Above, the data was displayed using three different visualization packages.

The first plot_usmap uses the `ggplot2 methodology which is tied to the “tidyverse” landscape of the R community.

The second voronoi.mosaic uses the graph-theory topology known as Voroni partitioning https://en.wikipedia.org/wiki/Voronoi_diagram with the “base” plot function to visualize the topology.

The last one map is from the “base” environment.

Visually, the plot that was most appealing to me is the map created using ggplot2 and plot_usmap because I like how simple the colors are and how it includes the outline for every state (unlike the other two plots). I also think that it is useful that the names of the state capitals are included on this visual. 

I think that functionally, the map that presents the data most effectively depends on what information is being portrayed. For instance, the map from the base environment is useful when presenting where each capital is located with respect to its individual state. One thing that I did not like about this plot though, is that although the asterisks representing the capitals of Alaska and Hawaii are located in the correct place in relation to the contiguous United States, they are just floating in the middle of blank space, rather than being outlined by the shape of the appropriate state. The image created using plot_usmap presents the data most effectively if you are interested in visualizing to the audience not only which states the capitals are located in, but also the name of the capitals. One downfall of this map is that the capitals are not geographically accurate with respect to the state itself. This image also includes Alaska and Hawaii next to California. 

One key factor in determining the appropriateness of including Alaska and Hawaii next to California is the information that is to be portrayed. For example, I think that it is appropriate to include these states next to California in this case, where we are more interested in the capitals of each state rather than the geography of the states themselves. A time when it may not be appropriate to do this is if we are using visualization to represent the distances between the states, because in this case, accurate geography is important to correctly represent what is going on.

[ Visually, which is the most appealing to you? Why?

Functionally, which presents the data most effectively? Why?

When we create visualizations, it is essential to portray the data accurately. For example, there are times when putting Alaska/Hawaii next to California might be appropriate, and other times it might not be.

What is a one key factor that would determine this appropriateness?

]


2.6.4.3 (5 points) Building the distance matrix

The dataframe we are using has been named latlong to represent the latitudes and longitudes of the 50 U.S. cities in America.

# manual conversion
# how many miles is 1 degree of latitude
latitude.factor = 69;  # rough mile estimate  # 68.703 ?
latlong$x.lat = latlong$latitude * latitude.factor;

longitude.factor = 54.6;  # rough mile estimate  
latlong$y.long = latlong$longitude * longitude.factor;

latlong = moveColumnsInDataFrame(latlong, c("y.long","x.lat"), "before", "longitude");

Let’s start with geo-spatial distances. I will do distMeeus and you will do distHaversine


2.6.4.3.1 Meeus

These distance formulas can utilize the true geo-spatial coordinates. The distance table is getting large, so there is a helper function to lookup a certain value.

library(geosphere);
library(measurements);


dist.meeus = conv_unit(  distm( latlong[,3:4], fun=distMeeus),  "m", "mi");  # default meters to miles

dist.meeus.m = as.matrix( dist.meeus );
  rownames(dist.meeus.m) = myLabels;
  colnames(dist.meeus.m) = myLabels;

dist.meeus.df = as.data.frame( round( dist.meeus.m, digits=1) );

dist.meeus.df;  ## too big
lookupPairwiseValue(dist.meeus.df, "Juneau, AK", "Montgomery, AL");
## [1] 2855.6

2.6.4.3.2 Haversine

You can compare the two with the code below (currently in comments).

dist.haversine = conv_unit(  distm( latlong[,3:4], fun=distHaversine),  "m", "mi");  # default meters to miles

dist.haversine.m = as.matrix( dist.haversine );
  rownames(dist.haversine.m) = myLabels;
  colnames(dist.haversine.m) = myLabels;

dist.haversine.df = as.data.frame( round( dist.haversine.m, digits=1) );

dist.haversine.df;  ## too big
lookupPairwiseValue(dist.haversine.df, "Juneau, AK", "Montgomery, AL");
## [1] 2854.6
# comparison code below

x = dist.meeus.df[,2]; # Juneau ... pick one location
y = dist.haversine.df[,2];

my.lim = c(0, 4000 );

plot(x, y,
            xlab="Meeus",
            ylab="Haversine",
            main="dist( Juneau, AK )",
            xlim=my.lim,
            ylim=my.lim);

plotXYwithBoxPlots(x, y,
            xlab="Meeus",
            ylab="Haversine",
            main="dist( Juneau, AK )",
            xlim=my.lim,
            ylim=my.lim);
When examining the above data, it does not seem that there is much difference between these two calculations because the points fall almost perfectly in a line. The boxplots for each of these distances are also symmetric with respect to each other. 

Both of these distance methods find the shortest distance between two points. The conceptual difference between these two calculations is that the Haversine method assumes a spherical earth and ignores ellipsoidal effects, whereas the Meeus method assumes an ellipsoid, and finds the distance between the two points on that ellipsoid.

The results of these two methods are very similar, but a method that provides a more accurate distance algorithm for geo-spatial calculations is distGeo which calculates the shortest path, called the geodesic, between two points on an  ellipsoid.

[ Examining the data above, is there much difference between these two calculations?

What is the conceptual difference between these two calculations? Try ?distMeeus or ?distHaversine

Are the results similar?

Is there a more accurate distance algorithm for geo-spatial calculations? If so, what is it?

]


Let’s now do manhattan and euclidean which are more common in the “statistical clustering” domain. I will do euclidean.


2.6.4.3.3 Manhattan
dist.manhattan = dist( latlong[,1:2],
                      method="manhattan", diag=TRUE, upper=TRUE);
dist.manhattan.m = as.matrix( dist.manhattan );
  rownames(dist.manhattan.m) = myLabels;
  colnames(dist.manhattan.m) = myLabels;

dist.manhattan.df = as.data.frame( round( dist.manhattan.m, digits=1) );

dist.manhattan.df;  ## too big
lookupPairwiseValue(dist.manhattan.df, "Juneau, AK", "Montgomery, AL");
## [1] 4418

2.6.4.3.4 Euclidean

We have converted the true latitude/longitude to a miles-type format, so the resulting table will report miles.

dist.euclidean = dist( latlong[,1:2],
                      method="euclidean", diag=TRUE, upper=TRUE);
dist.euclidean.m = as.matrix( dist.euclidean );
  rownames(dist.euclidean.m) = 
  colnames(dist.euclidean.m) = myLabels;

dist.euclidean.df = as.data.frame( round( dist.euclidean.m, digits=1) );

dist.euclidean.df;  ## too big
lookupPairwiseValue(dist.euclidean.df, "Juneau, AK", "Montgomery, AL");
## [1] 3179.8

You can compare the two with the code below (currently in comments).

x = dist.manhattan.df[,2]; # Juneau ... pick one location
y = dist.euclidean.df[,2];

my.lim = c(0, 4000 );

plot(x, y,
            xlab="Manhattan",
            ylab="Euclidean",
            main="dist( Juneau, AK )",
            xlim=my.lim,
            ylim=my.lim);

plotXYwithBoxPlots(x, y,
            xlab="Manhattan",
            ylab="Euclidean",
            main="dist( Juneau, AK )",
            xlim=my.lim,
            ylim=my.lim);
The main difference between these two calculations is that Euclidean distance is calculated as the shortest straight line distance between two points ("as the crow flies"), where as Manhattan distance used the assumption that you are constrained to city blocks. When using Manhattan distance, you are not able to walk along the diagonals because there would be buildings in the way. Instead, you must stick to the "grid" of the city streets. 

Although the results for the two calculations seem to follow a similar pattern, I would not necessarily say that the calculations themselves are similar because they do not lie in a completely straight line when plotted against each other. It is clear that they are not as similar as the distances compared above. 

[ What is the difference between these two calculations?

Are the results similar?

]


2.7 HIERARCHICAL clustering as a function of distance

Aggregating or agglomerating data is typically called “clustering” and is generally considered to be a “unsupervised learning” method.

2.7.1 Introduction

We use hclust to perform Hierarchical Clustering. The word “hierarchical” is used because the nature of the data is organized like a family genealogy. The bottom of the tree represents the descendants that eventually “link” to a common ancestor.

If you type ?hclust you can review the parameter options which we explored in a notebook. hclust is primarily a function of dist() [distance], and we have just computed some distances, so we could try and apply hclust to our data to see if our state capitals will cluster into meaningful regions.

There are several agglomeration methods (“linkage”) one could choose from. Each method takes the dist https://en.wikipedia.org/wiki/Hierarchical_clustering#Metric and performs some pairwise distance algorithm called a linkage criteria https://en.wikipedia.org/wiki/Hierarchical_clustering#Linkage_criteria.

I generally use either the “complete” linkage method or the “ward.D2” linkage method. You can read help ?hclust to better appreciate why: “A number of different clustering methods are provided. Ward’s minimum variance method aims at finding compact, spherical clusters. The complete linkage method finds similar clusters.”

If you are trying to link binary data (zeroes and ones) or genomic data (where the distances were computed using a genetic-distance algorithm), you may want to try the UPGMA approach: “average” or “centroid”.

2.7.2 Analogy of Family

Since I am from a large family, I will make a family tree analogy. For me, I am the 5th of 11 siblings (5 brothers, 5 sisters). If I wanted to cluster the siblings in a pair-wise fashion, how would I begin?

First, I would ask, which other sibling is most like me? Since this is a pair-wise approach, and there are 11 siblings, maybe at the initial stage, I will not be paired with another sibling.

In fact, at least one will not be paired because the total number is 11. And maybe more than one will not be paired in the initial stage. Remember “similarity” is being defined based on some distance-linkage method. And to use this approach, “similarity” needs to numerical data.

At each stage, pair-wise joining occurs until there is nothing left to join. The tree contains all of the elements. Every element (branch) eventually joins the main branch (trunk) of the tree.

2.7.3 Clustering U.S. capital cities based on latitude, longitude

We already have some data for the U.S. capitals and have computed a Euclidean distance using a capital-city’s longitude and latitude. Let’s choose to cut the result into 12 agglomerations. Why 12? It was a choice based on my life experience and intuition. As I reflected on why, I did some external searching that validates a choice in that range https://en.wikipedia.org/wiki/List_of_regions_of_the_United_States. You certainly could run this analysis with another choice It is exploratory, and your intuition matters.

Geographically, I am saying the capital-city does represent the state. An ideal representation may be in the center (centroid) of the state. Remember this when we see the linkages with “New York”, the city is Albany, not Manhattan.

## ward.D2
hclust.ward2.dist.euclidean = hclust(dist.euclidean, method="ward.D2");

plot( hclust.ward2.dist.euclidean, 
      labels= myLabels );
rect.hclust( hclust.ward2.dist.euclidean, k=12 );

## complete
# 
# hclust.complete.dist.euclidean = hclust(dist.euclidean, method="complete");
# 
# plot( hclust.ward2.dist.euclidean, 
#       labels= myLabels );
# rect.hclust( hclust.ward2.dist.euclidean, k=12 );
#   

2.7.4 Understanding the cutree

In the above example, the tree was cut into 12 groups based on the “distance” formula used and based on the “agglomeration” linkage technique invoice. I likely should have used a “geo-spatial” distance, but we will see that mere “euclidean” distance seems to perform okay.

A fancy word for this statistical tree is a “dendrogram”. I call each element of the tree a branch. The smallest branches (twigs) are the fundamental elements, in this case the cities. Over time, they merge with other small branches, and so on.

The relative height when this smallest branch merges with another branch demonstrates when the branch has found a similar pair-wise match (with another smallest branch or a merging branch). I call “Honolulu Hawaii” an isolate because the vertical height when it merges with another branch is the highest of all of the smallest branches (e.g., cities). “Juneau Alaska” is another isolate, but it does merge before “Hawaii”.

It would be nice if we could decompose this information and look at one cutree at a time. And color-code the distinctions. Using the function plot.hclust.sub we can do this.

source_url( paste0(path.github,"humanVerseWSU/R/functions-EDA.R") );  # EDA functions ...
## SHA-1 hash of file is dd72e464241952b23d5b08d2213099eccf7a86f6
hclust.ward2.dist.euclidean$labels = myLabels;
plot.hclust.sub(hclust.ward2.dist.euclidean, k=12);
## Registered S3 method overwritten by 'dendextend':
##   method       from   
##   text.pvclust pvclust

## [1] "Pruning 1 of 12"
## [1] "Pruning 2 of 12"
## [1] "Pruning 3 of 12"
## [1] "Pruning 4 of 12"
## [1] "Pruning 5 of 12"
## [1] "Pruning 6 of 12"
## [1] "Pruning 7 of 12"
## [1] "Pruning 8 of 12"
## [1] "Pruning 9 of 12"
## [1] "Pruning 10 of 12"
## [1] "Pruning 11 of 12"
## [1] "Pruning 12 of 12"

# plot.hclust.sub(hclust.complete.dist.euclidean, k=12);

2.7.5 (10 points) Review one clustering tree (dendrogram)

Choose either hclust.ward2.dist.euclidean or hclust.complete.dist.euclidean and review how the U.S. state capitals are clustered. [I commented out one form, so you will have to re-run if you want to select that one.]

Comment on the “face validity” of this approach based on your understanding about how the U.S. regions are defined? Are the North/South Dakotas together? What about the North/South Carolinas? What about the Pacific Northwest? While living in Kentucky, some people called the area “Kentuckiana” meaning Kentucky/Indiana. Does that show up? Also note anything that seems peculiar.

When looking at the hclust.ward2.dist.euclidean, in general, the state capitals are clustered with nearby states, but not necessarily by region. Because Alaska and Hawaii do not have any neighboring states, these two states are each a part of their own cluster. The cluster in the western part of the United States generally have less states in their clusters than the states in the Midwest and eastern part of the country. 

Based on how I understand the U.S. regions to be defined, this approach did not cluster very accurately by region. Many of the regions are missing states, or have other states added to them. The North and South Dakotas did cluster together, even at the lowest level. The North and South Carolinas did end up in the same cluster, but at the lowest level, North Carolina is paired with Virginia and South Carolina merges in later. This clustering tree split up the Pacific Northwest, with Washington and Oregon being clustered with California and Nevada, and Idaho being clustered with Montana and Utah. "Kentuckiana" does show up in this clustering method, as Kentucky and Indiana merge with each other first, before they merge with any other states. Something that seems peculiar to me is that North Dakota, South Dakota, Wyoming, and Colorado are in the same cluster, because there is quite a bit of distance between Colorado and North Dakota. It seems that Colorado would be better in a cluster with one of its other neighboring states. 

2.7.6 Additional remarks about hclust

I find hclust to be a nice initial perusal of the data.

If you want to understand the stability of a particular hclust to use it for something other than an “initial perusal of the data,” I would recommend pvclust which I introduced in the weekly notebooks. It is often used in peer-reviewed research, as it provides a p-value of sorts regarding the stability of the hclust structure.


2.8 GENERIC clustering

Aggregating or agglomerating data is typically called “clustering”. In exploration, we can create some generic clustering techniques. Below we will create two adhoc clustering rules.


2.8.1 Arbitrary Aggregation

Recall the “movie dataset” with “Will Smith” and “Denzel Washington”. We have a collection of movies, and how much money each movie made at the box office. We could organize those movies by some arbitrary rules. For example:

  • Cluster 1: NA. We have missing data regarding the money. So let’s put all movies that are NA into that cluster.
  • Cluster 2: Under a million dollars
  • Cluster 3: 1-4.99… million dollars (greater than or equal to one, but less than 5)
  • Cluster 4: 5-49.99…
  • Cluster 5: 50+

2.8.1.1 (5 points) Movie Aggregation [Arbitrary] for Will and Denzel

library(devtools);
source_url( paste0(path.mshaffer, "will") );
## SHA-1 hash of file is 3c1c97d961577e9b47a12faa20dca74bfe755c87
source_url( paste0(path.mshaffer, "denzel") );
## SHA-1 hash of file is 6692d8c5629db1ae023691bb55de2664a48cdde0
movies.50 = rbind(will$movies.50, denzel$movies.50);

unique(movies.50$ttid); # are they in any shared movies ???
##   [1] "tt0480249" "tt1386697" "tt0116629" "tt0119654" "tt0343818" "tt0454921"
##   [7] "tt0448157" "tt0120912" "tt1409024" "tt0386588" "tt0814314" "tt0112442"
##  [13] "tt0172156" "tt0120660" "tt6139732" "tt0145660" "tt2381941" "tt1815862"
##  [19] "tt1596350" "tt1229340" "tt5519340" "tt0307453" "tt1155076" "tt0120891"
##  [25] "tt0120783" "tt1502397" "tt0248667" "tt4682786" "tt3322364" "tt1025100"
##  [31] "tt0114558" "tt0300051" "tt0284490" "tt0146984" "tt1837709" "tt0338466"
##  [37] "tt0947802" "tt1823664" "tt0268397" "tt1082886" "tt5814534" "tt3721964"
##  [43] "tt0416212" "tt0108149" "tt0328099" "tt0167427" "tt0466839" "tt0107478"
##  [49] "tt7255568" "tt0466856" "tt0765429" "tt0139654" "tt0454848" "tt0455944"
##  [55] "tt0328107" "tt1907668" "tt0453467" "tt1037705" "tt0107818" "tt1599348"
##  [61] "tt0210945" "tt1272878" "tt1111422" "tt0477080" "tt2404435" "tt0145681"
##  [67] "tt3766354" "tt0251160" "tt0097441" "tt0368008" "tt0112740" "tt2671706"
##  [73] "tt0174856" "tt0104797" "tt0107798" "tt0119099" "tt0133952" "tt0313443"
##  [79] "tt0427309" "tt0115956" "tt0107616" "tt0124718" "tt0168786" "tt6000478"
##  [85] "tt0114857" "tt0112857" "tt0102789" "tt0092804" "tt0100168" "tt0117372"
##  [91] "tt0088146" "tt0097880" "tt0102456" "tt0099750" "tt0091786" "tt0082138"
##  [97] "tt0097373" "tt4283892" "tt1698652" "tt0118783"
loadInflationData();

movies.50 = standardizeDollarsInDataFrame(movies.50, 
                2000, 
                "millions", 
                "year", 
                "millionsAdj2000");

movies.50$cluster.arbitrary = NA;
str(movies.50);
## 'data.frame':    100 obs. of  13 variables:
##  $ rank             : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ title            : chr  "I Am Legend" "Suicide Squad" "Independence Day" "Men in Black" ...
##  $ ttid             : chr  "tt0480249" "tt1386697" "tt0116629" "tt0119654" ...
##  $ year             : num  2007 2016 1996 1997 2004 ...
##  $ rated            : chr  "PG-13" "PG-13" "PG-13" "PG-13" ...
##  $ minutes          : num  101 123 145 98 115 117 92 88 106 118 ...
##  $ genre            : chr  "Action, Adventure, Drama" "Action, Adventure, Fantasy" "Action, Adventure, Sci-Fi" "Action, Adventure, Comedy" ...
##  $ ratings          : num  7.2 6 7 7.3 7.1 8 6.4 6.2 6.8 6.6 ...
##  $ metacritic       : num  65 40 59 71 59 64 49 49 58 58 ...
##  $ votes            : num  675193 588111 520657 507618 491489 ...
##  $ millions         : num  256 325 306 251 145 ...
##  $ millionsAdj2000  : num  213 233 336 269 132 ...
##  $ cluster.arbitrary: logi  NA NA NA NA NA NA ...
## you do something here ... 

# (1) populate cluster.arbitrary

for (i in 1:100)
{
  if (is.na(movies.50$millionsAdj2000[i]))
  {
    movies.50$cluster.arbitrary[i] = "Cluster 1"
  }
  else if ((movies.50$millionsAdj2000[i]) < 1)
  {
    movies.50$cluster.arbitrary[i] = "Cluster 2"
  }
  else if (movies.50$millionsAdj2000[i] >= 1 & movies.50$millionsAdj2000[i] < 5)
  {
    movies.50$cluster.arbitrary[i] = "Cluster 3"
  }
  else if (movies.50$millionsAdj2000[i] >= 5 & movies.50$millionsAdj2000[i] < 50)
  {
    movies.50$cluster.arbitrary[i] = "Cluster 4"
  }
  else if (movies.50$millionsAdj2000[i] >= 50 )
  {
    movies.50$cluster.arbitrary[i] = "Cluster 5"
  }
}

movies.50$cluster.arbitrary;
##   [1] "Cluster 5" "Cluster 5" "Cluster 5" "Cluster 5" "Cluster 5" "Cluster 5"
##   [7] "Cluster 5" "Cluster 5" "Cluster 5" "Cluster 5" "Cluster 5" "Cluster 5"
##  [13] "Cluster 5" "Cluster 5" "Cluster 5" "Cluster 5" "Cluster 4" "Cluster 4"
##  [19] "Cluster 4" "Cluster 5" "Cluster 1" "Cluster 5" "Cluster 5" "Cluster 5"
##  [25] "Cluster 5" "Cluster 5" "Cluster 5" "Cluster 4" "Cluster 4" "Cluster 4"
##  [31] "Cluster 4" "Cluster 4" "Cluster 4" "Cluster 4" "Cluster 2" "Cluster 4"
##  [37] "Cluster 4" "Cluster 5" "Cluster 5" "Cluster 3" "Cluster 1" "Cluster 3"
##  [43] "Cluster 4" "Cluster 4" "Cluster 4" "Cluster 4" "Cluster 1" "Cluster 5"
##  [49] "Cluster 2" "Cluster 4" "Cluster 5" "Cluster 5" "Cluster 5" "Cluster 5"
##  [55] "Cluster 5" "Cluster 5" "Cluster 5" "Cluster 5" "Cluster 5" "Cluster 5"
##  [61] "Cluster 5" "Cluster 5" "Cluster 5" "Cluster 5" "Cluster 5" "Cluster 5"
##  [67] "Cluster 5" "Cluster 5" "Cluster 4" "Cluster 5" "Cluster 5" "Cluster 4"
##  [73] "Cluster 5" "Cluster 5" "Cluster 5" "Cluster 4" "Cluster 4" "Cluster 4"
##  [79] "Cluster 4" "Cluster 5" "Cluster 4" "Cluster 4" "Cluster 4" "Cluster 4"
##  [85] "Cluster 4" "Cluster 4" "Cluster 4" "Cluster 4" "Cluster 4" "Cluster 5"
##  [91] "Cluster 4" "Cluster 4" "Cluster 4" "Cluster 4" "Cluster 4" "Cluster 4"
##  [97] "Cluster 2" "Cluster 2" "Cluster 1" "Cluster 1"
# (2) summarize how many movies live in each (table count)

table(movies.50$cluster.arbitrary);
## 
## Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 
##         5         4         2        38        51

2.8.2 Aggregating using Quantiles

John Tukey emphasized that ordering the data and then splitting it based on the ordering was a fundamental premise of exploratory data analysis.

2.8.2.1 Tukey’s Summary Data

John Tukey proposed five elements as primary data for analysis.

  • the minimum min
  • the maximum max

Sorting the data makes it easiest to find these data, and will be useful to find the other three exploratory summary features.

This is what I would call “slice and dice”. The data is cut in half, and the value of that middle “cutting point” is the Q2 which we call the median.

Next, the lower half could also be cut in half, and the value of that middle “cutting” point is Q1.

Then, the upper half could also be cut in half, and the value of that middle “cutting” point is Q3.

A common metric derived from this “median-split” procedure is called the interquartile range IQR which is defined as the distance between Q3 and Q1. It literally represents the middle 50% of the data; 50% of the elements of the dataset are between Q3 and Q1.


2.8.2.2 Quartile Example

x = 1:99;
length(x);
## [1] 99
median(x);
## [1] 50
x[50];
## [1] 50
median(x[1:49]);
## [1] 25
x[25];
## [1] 25
median(x[51:99]);
## [1] 75
x[75];
## [1] 75
# probability-approach ... 
# algorithm: the default type=7
stats::quantile(x, prob = c(0.25, 0.5, 0.75), type=1);
## 25% 50% 75% 
##  25  50  75

Algorithms address various issues associated with dividing numbers and whether or not to include the dividing number in the subset division, but the principle holds.

We can generalize this idea by not always cutting the data in half (median-split as n=2, median-median-split as n=4). Instead, we could cut by tens (we call them deciles). Or we could cut by hundreds (we call them centiles). The function quantile performs this operation, and if you dig into the doStatsSummary function used in this course, you can see its application.


2.8.2.3 (5 points) Movie Aggregation [Decile] for Will and Denzel

movies.50$cluster.deciles = NA;
str(movies.50);
## 'data.frame':    100 obs. of  14 variables:
##  $ rank             : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ title            : chr  "I Am Legend" "Suicide Squad" "Independence Day" "Men in Black" ...
##  $ ttid             : chr  "tt0480249" "tt1386697" "tt0116629" "tt0119654" ...
##  $ year             : num  2007 2016 1996 1997 2004 ...
##  $ rated            : chr  "PG-13" "PG-13" "PG-13" "PG-13" ...
##  $ minutes          : num  101 123 145 98 115 117 92 88 106 118 ...
##  $ genre            : chr  "Action, Adventure, Drama" "Action, Adventure, Fantasy" "Action, Adventure, Sci-Fi" "Action, Adventure, Comedy" ...
##  $ ratings          : num  7.2 6 7 7.3 7.1 8 6.4 6.2 6.8 6.6 ...
##  $ metacritic       : num  65 40 59 71 59 64 49 49 58 58 ...
##  $ votes            : num  675193 588111 520657 507618 491489 ...
##  $ millions         : num  256 325 306 251 145 ...
##  $ millionsAdj2000  : num  213 233 336 269 132 ...
##  $ cluster.arbitrary: chr  "Cluster 5" "Cluster 5" "Cluster 5" "Cluster 5" ...
##  $ cluster.deciles  : logi  NA NA NA NA NA NA ...
## you do something here ... 

# use # stats::quantile(x, prob=seq(0.1,0.9,by=0.1), type=1 );

# (1) how many NA's are there ... keep them NA's

sum(is.na(movies.50$millionsAdj2000))
## [1] 5
# (2) for the rest of the data, break it up into deciles

millionsAdj2000.withoutNAs = na.omit(movies.50$millionsAdj2000)
stats::quantile(millionsAdj2000.withoutNAs, prob=seq(0.1,0.9,by=0.1), type=1 );
##        10%        20%        30%        40%        50%        60%        70% 
##   7.638768  20.177743  27.511542  37.259081  53.554796  64.786272  74.509802 
##        80%        90% 
## 115.650000 146.638920
deciles.vec = quantile(millionsAdj2000.withoutNAs, prob=seq(0.1,0.9,by=0.1), type=1 );

# (3) $cluster.deciles for a given movie should be NA, 1, 2, 3, ... 10

for (i in 1:100)
{
  if (is.na(movies.50$millionsAdj2000[i]))
  {
    movies.50$cluster.deciles[i] = "NA"
  }
  else if ((movies.50$millionsAdj2000[i]) < deciles.vec[1])
  {
    movies.50$cluster.deciles[i] = 1
  }
  else if ((movies.50$millionsAdj2000[i]) >= deciles.vec[1] & (movies.50$millionsAdj2000[i]) < deciles.vec[2])
  {
    movies.50$cluster.deciles[i] = 2
  }
  else if ((movies.50$millionsAdj2000[i]) >= deciles.vec[2] & (movies.50$millionsAdj2000[i]) < deciles.vec[3])
  {
    movies.50$cluster.deciles[i] = 3
  }
  else if ((movies.50$millionsAdj2000[i]) >= deciles.vec[3] & (movies.50$millionsAdj2000[i]) < deciles.vec[4])
  {
    movies.50$cluster.deciles[i] = 4
  }
  else if ((movies.50$millionsAdj2000[i]) >= deciles.vec[4] & (movies.50$millionsAdj2000[i]) < deciles.vec[5])
  {
    movies.50$cluster.deciles[i] = 5
  }
  else if ((movies.50$millionsAdj2000[i]) >= deciles.vec[5] & (movies.50$millionsAdj2000[i]) < deciles.vec[6])
  {
    movies.50$cluster.deciles[i] = 6
  }
  else if ((movies.50$millionsAdj2000[i]) >= deciles.vec[6] & (movies.50$millionsAdj2000[i]) < deciles.vec[7])
  {
    movies.50$cluster.deciles[i] = 7
  }
  else if ((movies.50$millionsAdj2000[i]) >= deciles.vec[7] & (movies.50$millionsAdj2000[i]) < deciles.vec[8])
  {
    movies.50$cluster.deciles[i] = 8
  }
  else if ((movies.50$millionsAdj2000[i]) >= deciles.vec[8] & (movies.50$millionsAdj2000[i]) < deciles.vec[9])
  {
    movies.50$cluster.deciles[i] = 9
  }
  else if ((movies.50$millionsAdj2000[i]) >= deciles.vec[9])
  {
    movies.50$cluster.deciles[i] = 10
  }
}

movies.50$cluster.deciles;
##   [1] "10" "10" "10" "10" "9"  "9"  "10" "10" "9"  "10" "6"  "7"  "9"  "9"  "10"
##  [16] "10" "5"  "5"  "5"  "8"  "NA" "10" "9"  "9"  "7"  "9"  "6"  "3"  "3"  "2" 
##  [31] "2"  "3"  "4"  "4"  "1"  "4"  "4"  "6"  "8"  "1"  "NA" "1"  "4"  "2"  "4" 
##  [46] "4"  "NA" "6"  "1"  "2"  "8"  "8"  "8"  "7"  "7"  "7"  "6"  "8"  "8"  "8" 
##  [61] "9"  "6"  "5"  "6"  "7"  "7"  "7"  "7"  "5"  "6"  "8"  "5"  "5"  "6"  "9" 
##  [76] "3"  "5"  "5"  "3"  "7"  "3"  "3"  "3"  "2"  "3"  "2"  "4"  "2"  "3"  "5" 
##  [91] "4"  "1"  "2"  "1"  "1"  "2"  "1"  "1"  "NA" "NA"
# (4) summarize how many movies live in each (table count)

table(movies.50$cluster.deciles);
## 
##  1 10  2  3  4  5  6  7  8  9 NA 
##  9 10  9 10  9 10  9 10  9 10  5

2.9 CENTROID clustering (k-means) as a function of distance

2.9.1 Introduction

Rather than clustering on distance-linkage in a pair-wise fashion, we can cluster based on randomly selecting just k points in our data and begin identifying their nearest neighbors using some distance approach.

For example, if k=3, we would randomly select three of our data points. We would then compute the distances from all of the remaining points to these 3 anchor points. The points that are closest to a given anchor will be assigned to that anchor. At the end of the stage, we now have new data, so a new centroid is determined. At this point, the centroid https://en.wikipedia.org/wiki/Centroid is likely not one of our data points, but a location within the given centroid cluster. In the “naive” approach, you merely take an average (mean) of all the members of your cluster. More advanced approaches (the default “Hartigan-Wong” of kmeans) utilize deviations from the average, called a sum-of-squares approach. This is why in our kmeans-notebook we analyzed the wss to ascertain how many clusters k should we use.

Regardless, after the new centroids (centers) for the clusters are determined, the process iterates. All distances are computed from all points to the new centroids; points are assigned to a given centroid cluster (in this example: 1, 2, 3); a new centroid center is computed, and we repeat the process until a stopping rule is reached: maybe we have exhausted the number of iterations allowed (iter.max parameter of kmeans)? Or maybe we are not changing membership of any of the data points? Or maybe we have met some objective (like wss)?

It is possible to get a kmeans result by merely starting with 3 different random points. In general, kmeans is fast (and our computers are so much faster than the computers of 1990: the first computer I built in 1995 had 32MB of RAM, a Pentium processor https://en.wikipedia.org/wiki/Pentium#Pentium and a 900MB Cheetah hard-drive).

Anyway, we can utilize the parameter nstart to try many different starting values, and let the program identify the best, most consistent solution.

2.9.2 My recommendations

I would recommend the default algorithm Hartigan-Wong with iter.max=100 and nstart=100. You can test the timings from the default values: iter.max=10 and nstart=1. Since the data is likely multidimensional stars are the best way to summary the results and membership of kmeans. Please see the “kmeans-notebook” for examples.


2.9.3 WIKIPEDIA CLIMATE DATA

For the 50 U.S. cities, we harvested climate data. In the “Wikipedia” notebook, details of that data provenance were outlined. If we want to compare the cities using the climate data, we have to build a matching dataframe, which means we have to select features that exist in each city’s climate data set.

There are four temperature features that are always present:

  • Record high F (C) … we will call it “high.max”
  • Average high F (C) … we will call it “high.avg”
  • Average low F (C) … we will call it “low.avg”
  • Record low F (C) … we will call it “low.max” [probably not the best name choice, but it is parallel in form to the first element]

Additionally, for precipitation we have:

  • Average precipitation inches (mm) … we will call it “rain”
  • Average snowfall inches (cm) … we will call it “snow”

For each of these features we have 12 months of data. This is a nice dataset. Let’s see what we can do with it.

2.9.3.1 Basic Background Research

We should begin by doing a bit of peripheral research on the topic, to gain “domain knowledge” so we know what to do with the data. A few elements that would benefit.

https://www.forbes.com/sites/brianbrettschneider/2018/07/08/when-does-the-hottest-day-of-the-year-usually-occur/#78141f47548c

https://www.climate.gov/news-features/featured-images/whats-coldest-day-year


2.9.3.2 One Graph

When graphing data to visualize, it is essential that you keep the scales uniform so quick-visual comparisons are accurate. I gave you a task to practice the idea of creation that “one informative” research graphic. And I now present my version for you to use and critique based on your efforts. I am a plot guy, so some of you may have a different `ggplot2 type solution. Ultimately, the intent of this graphic is to best summarize the data in a meaninful way for exploratory analysis.


2.9.3.2.1 (5 points) One Research Graph
climate = utils::read.csv( paste0(path.mshaffer, "_data_/state-capitals/final/state-capitals-climatedata.txt"), header=TRUE, quote="", sep="|");


plotTemperatureFromWikipediaData(climate, city.key="capital", city.val="Helena");

plotTemperatureFromWikipediaData(climate, city.key="capital", city.val="Baton Rouge");


I think that this graphic does a good job of displaying the climate information in a visual manner. I like that it is color coded based on how warm or cold a temperature is and I think that it is nice how it displays so many different variables on one plot. 

One thing that I might change about this plot is to make the lines and the points less chunky. I think that overall, this would help de-clutter the graphic. I also think that it might be helpful to add a separate bar to show snowfall, rather than having it on the same chart as the rainfall. Lastly, if there is a way to make the shading between the record and average temperatures more transparent, it might be helpful to be able to see the grid lines throughout the graphic. 

Overall, this graphic is aesthetically pleasing. The only thing that I think would make it better would be to make the lines and points thinner, as mentioned above. This might help with the overall organization of the graph, as it would seem less cluttered. I think that this graph is highly functional and does an excellent job portraying relevant information. Looking at this graphic and drawing conclusions would be much easier than looking through the values of a data frame or table. 

[ What do you like about this graphic?

What do you dislike?

Is it aesthetically pleasing? Is it functional?

]


We can achieve a side-by-side comparison using the function described below. The first city will be graphed on the left, the second city on the right.

compareTwoCitiesClimates(climate, city.key="capital", city.1="Helena", city.2="Baton Rouge");


Similar to what I wrote above, I like how this graphic is able to display information based on multiple different factors. The way the information is displayed, it is easy to draw conclusions about this data. This graphic in particular makes it easy to compare the climate between two different locations.

One thing that I don't like about this graphic in particular is the way that a different number of grid lines are shown on the y-axis of each graph. Although the y-axes are the same scale between the two graphs, it does not look that way at first glance. I think that adding an equivalent number of grid lines to each graph, even if the points on the graph do not span them all, would solve this problem. The visible grid lines that appear on each graph are the same, there are just certain ones that are unique to each graph (120 on Baton Rouge and -20, -40 on Helena). I also think that it would be beneficial to make the shading transparent so the grid lines are able to be seen throughout the graphic.

The difference between rain and snow on this graphic is that rain is displayed as a point each month, whereas snow is displayed as a line graph that is shaded underneath. I don't necessarily think that this is a bad approach, but I think that if the snow and rain had their own, separate areas in the graphic it might be easier to read. When there is both snow and rain in a region, it seems a little hard to determine what the actual snowfall is.

[ What do you like about this graphic?

What do you dislike?

Are the y-axis the same scale? Are the visible gridlines for each the same?

What is the difference between rain and snow on the graphic? Was that a good approach? How would you have done it?

]


2.9.3.2.2 (5 points) One Publication Graph

We are in exploration, so the “one” research graphic may be very different than the “one” formal graphic designed for a client. Typically, the final graphic needs to meet certain criteria:

  • Very pleasing aesthetically
  • Interactive if possible
  • Live data feeds if possible
  • Served in a secure, safe, private location (if required by the client)

To demonstrate the differences, I created a mockup of our 50 U.S. cities and put it into a nice finalized product form called “highcharts”.

http://md5.mshaffer.com/WSU_STATS419/_EXAMPLES_/fiddle_usmap/

It pulls data in real time (using AJAX) to grab the weather at the latitudes/longitudes we defined in our Wikipedia notebook.

  • You can use your mouse to draw a box to zoom in.
  • The third-wheel on the mouse also helps you zoom in/out.
  • Hold down CNTRL with your left hand and use your mouse key to drag the map. It seems to only “pan” in the x-direction at the moment.
  • The data in the popup displayed can be customized. I report the temperature in Celsius/Fahrenheit and also display the population data we gathered from Wikipedia.

Once you have a template built, it is rather easy to modify it. Here I changed the background map, and all of the data/features stay the same:

http://md5.mshaffer.com/WSU_STATS419/_EXAMPLES_/fiddle_usmap/world.html


2.9.4 Which Features to Include in the Analysis

  • months, you can pick all of them 1:12 or maybe just one month per season (April, July, October, January)
  • X, depending on what you chose for months, you can now select what climate columns you want to use
    • Some of Temperature Data
    • All of Temperature Data
    • Precipitation Data
    • Everything (All of Temperature Data, Precipitation Data)

If we want to cluster cities, which decisions seem best? Why? As you can see from the code below, you just comment out two options, and can quickly rerun the analysis.

  • WHICH MONTHS
  • WHICH COLUMNS

2.9.4.1 WHICH MONTHS & WHICH COLUMNS

climate = utils::read.csv( paste0(path.mshaffer, "_data_/state-capitals/final/state-capitals-climatedata.txt"), header=TRUE, quote="", sep="|");

##################### WHICH MONTHS #####################
########################################################
months = 1:12; # all the data
#months = c(1,4,7,10); # one month of each of the four seasons
########################################################


month.abb;  # ?month.abb
##  [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec"
month.name;  
##  [1] "January"   "February"  "March"     "April"     "May"       "June"     
##  [7] "July"      "August"    "September" "October"   "November"  "December"
month.name[months];  # these are the names of the months you are selecting ...
##  [1] "January"   "February"  "March"     "April"     "May"       "June"     
##  [7] "July"      "August"    "September" "October"   "November"  "December"
# this function would allow us to use different months as criteria and different climate-data keys.  It is variadic and flexible.  `key.n` are the names we will use for our new columns ...

climate.df = buildClimateDataFrame(climate, months, keys=c("Record high F (C)", "Average high F (C)", "Average low F (C)", "Record low F (C)", "Average precipitation inches (mm)", "Average snowfall inches (cm)"), keys.n = c("highmax", "highavg",  "lowavg", "lowmin", "rain", "snow") );

climate.df;
names(climate.df); # this helps you see the indexes ...
##  [1] "state"       "capital"     "st"          "labels"      "highmax.Jan"
##  [6] "highmax.Feb" "highmax.Mar" "highmax.Apr" "highmax.May" "highmax.Jun"
## [11] "highmax.Jul" "highmax.Aug" "highmax.Sep" "highmax.Oct" "highmax.Nov"
## [16] "highmax.Dec" "highavg.Jan" "highavg.Feb" "highavg.Mar" "highavg.Apr"
## [21] "highavg.May" "highavg.Jun" "highavg.Jul" "highavg.Aug" "highavg.Sep"
## [26] "highavg.Oct" "highavg.Nov" "highavg.Dec" "lowavg.Jan"  "lowavg.Feb" 
## [31] "lowavg.Mar"  "lowavg.Apr"  "lowavg.May"  "lowavg.Jun"  "lowavg.Jul" 
## [36] "lowavg.Aug"  "lowavg.Sep"  "lowavg.Oct"  "lowavg.Nov"  "lowavg.Dec" 
## [41] "lowmin.Jan"  "lowmin.Feb"  "lowmin.Mar"  "lowmin.Apr"  "lowmin.May" 
## [46] "lowmin.Jun"  "lowmin.Jul"  "lowmin.Aug"  "lowmin.Sep"  "lowmin.Oct" 
## [51] "lowmin.Nov"  "lowmin.Dec"  "rain.Jan"    "rain.Feb"    "rain.Mar"   
## [56] "rain.Apr"    "rain.May"    "rain.Jun"    "rain.Jul"    "rain.Aug"   
## [61] "rain.Sep"    "rain.Oct"    "rain.Nov"    "rain.Dec"    "snow.Jan"   
## [66] "snow.Feb"    "snow.Mar"    "snow.Apr"    "snow.May"    "snow.Jun"   
## [71] "snow.Jul"    "snow.Aug"    "snow.Sep"    "snow.Oct"    "snow.Nov"   
## [76] "snow.Dec"
##################### WHICH COLUMNS #####################
########################################################
#X = climate.df[,5:52];  # temperature
#X = climate.df[,5:76];  # everything (includes rain)
X = climate.df[,5:20];  # temperature, 1 month per season
#X = climate.df[,5:28];  # everything (includes rain), 1 month per season
########################################################


  rownames(X) = climate.df$labels;
Xs = scale(X);

2.9.5 To scale or not to scale, that is the question

X = climate.df[,5:20];  # everything ... you have to change months above to get this dataframe to be the correct size ... months = 1:12
  rownames(X) = climate.df$labels;
Xs = scale(X);

So let’s do some analysis with all of the data available to us. Most of the data is in Temperature, with ranges from -42 degrees Fahrenheit (Helena, Montana) to 122 (Phoenix, Arizona).

The precipitation data (rain and snow) is measured in inches. So should we scale the data. The answer in PCA and orthogonal projections is absolutely YES, but for hclust and kmeans is that always the case?

You can make a choice below, and observe how it influences your answers.


2.9.5.1 WHICH X

whichX = X;
# whichX = Xs;

2.9.6 Perform k-means on All Climate Features

With the selected features let’s perform k-means. Let’s select k=12. I will select all of the features, but you can change that if you wish.

2.9.6.1 Descriptives of Sample

colors = rainbow(50, s = 0.6, v = 0.75); # 50 colors for 50 states

# descriptive star plot to start
stars(whichX, len = 0.5, key.loc=c(12,2), draw.segments = TRUE);

## too busy, let's group them
x.start = 1;
x.end = 10;
for(i in 1:5)
  {
  stars( whichX[x.start:x.end,] , 
          len = 0.5, key.loc=c(6,2), draw.segments = TRUE);
  x.start = 1 + x.end;
  x.end = 10 + x.end;
  }

Above, you are just analyzing the general shapes. Which ones are “fuller” circles? Why?

Which ones are not very “full circles”? Why?


2.9.6.2 Computation of Clusters/Centroids

k = 12;
iterations = 100;
number.starts = 100;

whichX.kmeans = kmeans(whichX, 12, 
                    iter.max=iterations, 
                    nstart = number.starts);  # default algorithm
stars(whichX.kmeans$centers, len = 0.5, key.loc = c(10, 3),
        main = "Algorithm: DEFAULT [Hartigan-Wong] \n Stars of KMEANS=12", draw.segments = TRUE);


2.9.6.3 Cluster Membership and Centroid Attributes

membership = matrix( whichX.kmeans$cluster, ncol=1);
membership = membership[order(membership),];
membership = as.data.frame(membership);
        rownames(membership) = climate.df$labels;
        colnames(membership) = c("Cluster");

membership;
print( table(membership) ) ; 
## membership
##  1  2  3  4  5  6  7  8  9 10 11 12 
##  1  1  5  6  4  3  8  4  5  7  1  5
# I believe in an older version of R these were called $centroids
attributes = as.data.frame( whichX.kmeans$centers );
    rownames(attributes) = paste0("Cluster.",1:12);
  attributes;

2.9.6.4 (10 points) Summarize Findings

  • Identify which states share a common cluster.
  • For a given cluster, what are its primary characteristics
Alabama is the only state that is a part of cluster 1. 

Alaska is the only state that is a part of cluster 2.

Arizona, Arkansas, California, Colorado, and Connecticut are the five states that are a part of cluster 3.

Delaware, Florida, Georgia, Hawaii, Idaho, and Illinois are the six states that are a part of cluster 4.

Indiana, Iowa, Kansas, and Kentucky are the four states that are a part of cluster 5.

Louisiana, Maine, and Maryland are the three states that are a part of cluster 6.

Massachusetts, Michigan, Minnesota, Mississippi, Missouri, Montana, Nebraska, and Nevada are the eight states that are a part of cluster 7.

New Hampshire, New Jersey, New Mexico, and New York are the four states that are a part of cluster 8.

North Carolina, North Dakota, Ohio, Oklahoma, and Oregon are the five states that are a part of cluster 9. 

Pennsylvania, Rhode Island, South Carolina, South Dakota, Tennessee, Texas, and Utah are the seven states that are a part of cluster 10.

Vermont is the only state that is a part of cluster 11.

Virginia, Washington, West Virginia, Wisconsin, and Wyoming are the five states that are a part of cluster 12.

[ Summarize your k-means findings for 12 clusters.]

2.10 (15 points) Correlation

Correlation, like distance, is an important feature of multivariate analysis. So let’s review some basic correlation related to our climate data. For simplicity, let’s consider “Record High Temperature” and “Record Low Temperature” and see how they correlate with other factors we have gathered from Wikipedia.

Recall, in this table, “Jan-Dec” are different months of the same temperature variable.

library(Hmisc); # p-values for correlation

high = subsetDataFrame(climate, c("key", "units"), "==", c("Record high F (C)",1));
high = merge(high, capitals, by=c("capital","state"));

high.X = high[,c(5:18,21)]; # numeric data
high.cor = round( cor(high.X), digits=2);
# high.cor.p = rcorr(as.matrix(high.X), type="pearson");  # p-values for statistical significance ... # str(high.cor.p);

# examine July (idx = 7)

as.data.frame( high.cor ) ; # so it will render nicely in RStudio
high.cor.july = high.cor[,7];
high.cor.july;
##                 Jan                 Feb                 Mar                 Apr 
##                0.25                0.45                0.53                0.65 
##                 May                 Jun                 Jul                 Aug 
##                0.85                0.90                1.00                0.91 
##                 Sep                 Oct                 Nov                 Dec 
##                0.86                0.73                0.46                0.27 
##            latitude           longitude population.2019.est 
##               -0.15                0.02                0.31
plot(high.cor.july[1:12], ylab="Correlation", xlab="Month number", main="Correlation Trend by Month")

The plot above visually displays the correlation of each month's record high with July's record high. As expected, the cooler months (months further from July) have weaker correlation with July than the warmer months (months closer to July) do. 

The correlation for the record high for any given month is positively correlated with July (for all months of the year). This suggests that if the record high for July were to increase, the record high for every other month would also increase, and vice versa. We can also say that if the record high for July were to decrease, the record high for every other month would also decrease, and vice versa.

January is correlated with July (0.25).  This correlation is very weak because it is relatively close to 0. Overall, this suggests that the likelihood of the record high temperature of January increasing/decreasing as the record high temperature of July increases/decreases is relatively low. In other words, these two variables have a lower likelihood of having a relationship than two variables with a higher correlation. 

February is correlated with July (0.45).  This correlation is weak because it is closer to 0 than it is to 1. Overall, this suggests that the likelihood of the record high temperature of February increasing/decreasing as the record high temperature of July increases/decreases is relatively low. In other words, these two variables have a lower likelihood of having a relationship than two variables with a higher correlation. 

March is correlated with July (0.53).  This correlation is moderately strong because it is close to 0.5. Overall, this suggests that the likelihood of the record high temperature of March increasing/decreasing as the record high temperature of July increases/decreases is moderately high. In other words, these two variables are likely to have a relationship, but it may not be very strong.

April is correlated with July (0.65).  This correlation is moderately strong because it is relatively close to 0.5. Overall, this suggests that the likelihood of the record high temperature of April increasing/decreasing as the record high temperature of July increases/decreases is moderately high. In other words, these two variables are likely to have a relationship, but it may not be very strong.

May is correlated with July (0.85).  This correlation is strong because it is relatively close to 1. Overall, this suggests that the likelihood of the record high temperature of May increasing/decreasing as the record high temperature of July increases/decreases is high. In other words, these two variables are likely to have a strong positive relationship.

June is correlated with July (0.90).  This correlation is strong because it is close to 1. Overall, this suggests that the likelihood of the record high temperature of June increasing/decreasing as the record high temperature of July increases/decreases is very high. In other words, these two variables are very likely to have a strong positive relationship.

As stated below, July perfectly correlates with July (1.00). This correlation is very strong and positive, which can be expected because they are the same data. 

August is correlated with July (0.91).  This correlation is strong because it is close to 1. Overall, this suggests that the likelihood of the record high temperature of August increasing/decreasing as the record high temperature of July increases/decreases is very high. In other words, these two variables are very likely to have a strong positive relationship.

September is correlated with July (0.86).  This correlation is strong because it is relatively close to 1. Overall, this suggests that the likelihood of the record high temperature of September increasing/decreasing as the record high temperature of July increases/decreases is high. In other words, these two variables are likely to have a strong positive relationship.

October is correlated with July (0.73).  This correlation is moderately strong. Overall, this suggests that the likelihood of the record high temperature of October increasing/decreasing as the record high temperature of July increases/decreases is moderately high. In other words, these two variables are likely to have a relationship, but it may not be very strong.

November is correlated with July (0.46).  This correlation is weak because it is closer to 0 than it is to 1. Overall, this suggests that the likelihood of the record high temperature of November increasing/decreasing as the record high temperature of July increases/decreases is relatively low. In other words, these two variables have a lower likelihood of having a relationship than two variables with a higher correlation.

December is correlated with July (0.27).  This correlation is very weak because it is relatively close to 0. Overall, this suggests that the likelihood of the record high temperature of December increasing/decreasing as the record high temperature of July increases/decreases is relatively low. In other words, these two variables have a lower likelihood of having a relationship than two variables with a higher correlation. 

Longitude is correlated with record temperature in July (0.02). This correlation is extremely weak, as its value is almost 0. Based on this number, I would not expect a relationship between longitude and the record high temperature in July.

Latitude is correlated with record temperature in July (-0.15). This correlation very weak, as its value is relatively close to 0. One thing to note is that this correlation is negative, so if there is a relationship between the two variables, it would be inversely proportional. In other words, as latitude increases, the record temperature in July would decrease, and vice versa. These two variables have a lower likelihood of having a relationship than two variables with a higher correlation.

Population is correlated with record temperature in July (0.31). This correlation is  weak because it is relatively close to 0. Overall, this suggests that the likelihood of the population increasing/decreasing as the record high temperature of July increases/decreases is relatively low. In other words, these two variables have a lower likelihood of having a relationship than two variables with a higher correlation. 

Intuitively, I would think that the months that have more extreme weather would correlate most with latitude for this data. These months would include the winter months of November, December, January, and February. This can be confirmed by looking at the data frame above. Looking at the data frame, it also seems that the summer months (June, July, and August) correlate least with latitude for this data. This correlation always has the same sign, negative.

Describe the correlation of July in “Record high F (C)” to the other numeric factors printed above.

x is correlated with y (0.00). This correlation is positive/negative which means … This correlation is strong/weak because … overall, this suggests …

  • July perfectly correlates with July (1.00). This correlation is positive and very strong. This is because they are the same data.

  • With the months, you can note each, or plot a trend showing them, and discussing them briefly as a trend.

  • latitude is a measure of north/south, so be certain to apply the correlation value with some meaning. be certain you know which direction is positive or negative to correctly interpret the sign of the correlation.

  • longitude is a measure of east/west, so be certain …

  • population is the size of the city

  • intuitively, which months do you think correlate most with latitude for this data? which correlate the least? is the correlation always the same sign (positive/negative), or does it change? [You can use the dataframe output to do this analysis, or create your own subset]

library(Hmisc); # p-values for correlation

low = subsetDataFrame(climate, c("key", "units"), "==", c("Record low F (C)",1));
low = merge(low, capitals, by=c("capital","state"));

low.X = low[,c(5:18,21)]; # numeric data
low.cor = round( cor(low.X), digits=2);
# low.cor.p = rcorr(as.matrix(low.X), type="pearson");  # p-values for statistical significance ... # str(low.cor.p);

# examine Jan (idx = 1)

as.data.frame( low.cor ) ; # so it will render nicely in RStudio
low.cor.january = low.cor[,1];
low.cor.january;
##                 Jan                 Feb                 Mar                 Apr 
##                1.00                0.95                0.93                0.92 
##                 May                 Jun                 Jul                 Aug 
##                0.89                0.80                0.74                0.77 
##                 Sep                 Oct                 Nov                 Dec 
##                0.90                0.87                0.91                0.97 
##            latitude           longitude population.2019.est 
##               -0.72               -0.33                0.34
plot(low.cor.january[1:12], ylab="Correlation", xlab="Month number", main="Correlation Trend by Month")

Describe the correlation of January in “Record low F (C)” to the other numeric factors printed above.

The plot above visually displays the correlation of each month's record low with January's record low. As expected, in general, the cooler months (months closer to January) have stronger correlation with January than the warmer months (months further from January) do. It is important to note that in this plot, the y-axis only goes down to 0.75. The lowest correlation between all of these months is 0.74, which is still a relatively strong correlation. Because of this, it is acceptable to say that the record low temperature of all months have a fairly strong positive relationship with the record low temperature in January.

The correlation for the record low for any given month is positively correlated with the record low for January (for all months of the year). This suggests that if the record low for January were to increase, the record low for every other month would also increase, and vice versa. We can also say that if the record low for January were to decrease, the record low for every other month would also decrease, and vice versa.

January is perfectly correlated with January (1.00). This correlation is very strong and positive, which can be expected because they are the same data.

February is correlated with January (0.95), March is correlated with January (0.93), April is correlated with January (0.92), May is correlated with January (0.89), September is correlated with January (0.90), October is correlated with January (0.87), November is correlated with January (0.91), and December is correlated with January (0.97). These correlations are all very strong because they are close to 1. Overall, this suggests that the likelihood of the record low temperatures of February, March, April, May, September, October, November, and December increasing/decreasing as the record low temperature of January increases/decreases is very high. In other words, these variables are very likely to have a strong positive relationship with the record low temperature of January.

The more moderate correlations (although they are still relatively strong) with respect to the record low in January are June (0.80), July (0.74), and August (0.77). Overall, this implies that there is a relatively high likelihood of the record low temperatures of these months following the trend of the record low temperature in January, but the likelihood is not as high as it is for other months (with higher correlation).

Longitude is correlated with record low temperature in January (-0.33). This correlation is relatively weak, as its value is closer to 0 than it is to -1. Based on this number, I would expect a very weak, inversely proportional relationship between longitude and the record low temperature in January. 

Latitude is correlated with record low temperature in January (-0.72). This correlation moderately strong, as its value is relatively close to -1. Because this correlation is negative, the relationship between the two variables is inversely proportional. In other words, as latitude increases, the record low temperature in January would decrease, and vice versa. These two variables have a relatively high likelihood of having a relationship.

Population is correlated with record low temperature in January (0.34). This correlation is  weak because it is relatively close to 0. Overall, this suggests that the likelihood of the population increasing/decreasing as the record low temperature of January increases/decreases is relatively low. In other words, these two variables have a lower likelihood of having a relationship than two variables with a higher correlation. A positive correlation for this variable does make sense though, because most people are likely to rather live somewhere with a more mild winter.

Similar to “high” writeup, but for the “low” data.

2.11 “So What” is DATA ANALYSIS?

In the social sciences (e.g., Karl Weick), the concept of “sense making” refers to “the process by which people give meaning to their collective experiences”. I have used this framework in my high-technology innovation research (See Figure 1 of http://www.mshaffer.com/arizona/pdf/LoneGenius.pdf, my rubric concept comes from learning-theory growth models: Nascent, Adolescent, Mature.)

This final topic is reflective: we are thinking about how we think.

2.11.1 Statistics

The syllabus defined statistics as “the discipline that concerns the collection, organization, analysis, interpretation and presentation of data.” (See https://en.wikipedia.org/wiki/Statistics)

There are 5 elements mentioned: collection, organization, analysis, interpretation, and presentation of “data”. Are those equally weighted? That is, should we devote 20% of our time to each of those? Now, consider the “analysis” stage. I have suggested there are two camps: exploratory and confirmatory data analysis. Are those equally weighted? That is, should we devote 50% of our time to each of those? Now, in an “equally-likely” scenario, we would have.

x = c(20,20,10,10,20,20);
x.labels = c("collection", "organization", "EDA analysis", "CDA analysis", "interpretation", "presentation");
x.colors = c("blue", "lightgreen", "green", "darkgreen", "orange", "red");

barplot(x, 
        col = x.colors,
        ylim = c(0, 20),
        ylab = "Proportion: (Sums to 100)",
        main="Statistics as the Study of 'Data'");


text(1.14* (1:6), par("usr")[3], col = x.colors, labels = x.labels, srt = 45, adj = c(1.1,1.1), xpd = TRUE, cex=.75);

2.11.2 Data Analytics

Source: https://data-analytics.wsu.edu/197-2/ (Accessed October 2010)

“Data analytics is the application of powerful new methods—drawn from computer science, mathematics and statistics, and domain sciences—to collect, curate, analyze, discover and communicate knowledge from ‘big data’.” https://data-analytics.wsu.edu/ (Accessed October 2010)

2.11.2.1 Importance of ‘Data’

I love data.

I also love math/physics. I also love exploratory data analysis. I also love computational statistics or statistical computing https://en.wikipedia.org/wiki/Computational_statistics. I also love thinking about developing the one graphic to summarize data most effectively.

2.11.2.2 Apprenticeship as Learning a Trade

The idea of sharing in the learning process is an important aspect of the apprenticeship model. You are learning a trade (data analytics). I have experience in this trade. My job as the instructor is to provide you with a variety of “situated-learning” experiences To help you understand the nature of the trade. This exam is an example of such an experience.

2.11.2.3 Tools of the Trade

Below are the core requirements for the data analytics program:

  • Calculus and linear algebra (10 credits)
  • Computer science fundamentals (11 credits)
  • Machine learning and data management (9 credits)
  • Statistics (15 credits)
  • Data analytics introduction, ethics & project-focused * capstone experience (9 credits)

These are not the tools of the trade, but hopefully, they introduce you to key tools of the trade. What exactly are tools of the trade? [You will have an opportunity to write a response below.]

2.11.2.4 Dimensional Reduction, an Axiomatic View

This video was recently shared with me that highlights some distinctions among persons practicing various forms of data analysis https://www.youtube.com/watch?v=uHGlCi9jOWY. As an orthogonal projection, I would create two axes. On the horizontal axis (x-axis), I would place “theory of data” to the left and “application of data” toward the right. On the vertical axis (y-axis), I would place “care for data integrity” at the top and “less care for data integrity” at the bottom.

2.11.2.5 Skills of the Trade

As someone that is coming from industry, having hired young people like you out of Computer Science, Electrical-Computer Engineering, I have opinions related to skills of the trade.

  • Can you acquire an appreciation for “data intimacy”?
  • Can you track and document how data is curated?
  • Can you track and document the analyses you perform? Can you recreate them? Do you have basic version-control protocols in place?
  • Can you view data from multiple perspectives and synthesize those perspectives to identify the central them of the data? Can you be objective? Can you try and identify objective metrics to enlighten your understanding about the essence of data?
  • Can you experiment with different visualizations in search of an optimal “one graphic” result? Do you have practice using various visualization tools? Can you comprehend which visualization tool is appropriate for messaging (communicating results) to a particular audience?
  • Can you communicate and defend your findings to a particular audience? Are your communications professional? Is the final work product both simple and comprehensive: simple in its summary findings and comprehensive in its ability to be replicated and audited as necessary.

2.12 (20 points) YOUR OPINION OF DATA ANALYTICS

[This is worth 20 points.

Specifically, address:

  1. what proportion of “statistics” should be divided among: collection, organization, analysis, interpretation, and presentation of “data” … providing a barplot of your opinion within your response would seem appropriate
x = c(10,10,20,15,25,20);  ## change these values and discuss ...
x.labels = c("collection", "organization", "EDA analysis", "CDA analysis", "interpretation", "presentation");
x.colors = c("blue", "lightgreen", "green", "darkgreen", "orange", "red");

barplot(x, 
        col = x.colors,
        ylim = c(0, max(x)),
        ylab = "Proportion: (Sums to 100)",
        main="Statistics as the Study of 'Data'");


text(1.14* (1:6), par("usr")[3], col = x.colors, labels = x.labels, srt = 45, adj = c(1.1,1.1), xpd = TRUE, cex=.75);

  1. what tools of the trade should you be acquiring from the core courses? how are you doing in that acquisition process (e.g., tool X is … and right now I feel like my understanding/proficiency of tool X … ) …

  2. utilize the provided plot script to place the coure-course categories on the proposed x-y graph related to analytics practice (Applied vs Theoretical) and care of data integrity (Great Care vs Little Care) … Also place your personal assessment on the plot script provided

  3. evaluate your skill-level on the six “skills of the trade”: Emerging (Nascent), Developing (Adolescent), Mastering (Mature). explain your evaluation and include other important skills you believe are relevant that are not included

  4. Any other comments you would like to share.

]

(1) I think that the largest proportion of "statistics" (25%) should be devoted to interpretation of the data because without proper interpretation, the field of statistics loses its purpose. Along those same lines, I think that presentation of the data should be 20% of "statistics". I think that the way that data and the conclusions that are drawn from it are presented is important, because if others are not able to understand why the conclusions drawn, or even what conclusions were drawn, those conclusions become almost useless. I think that as a data analyst, a very important part of someone's job is to make sure that they are able to effectively communicate their results. Also deserving 20% devotion is EDA analysis. This branch of statistics is very important for allowing new predictions and discoveries about all aspects of life. Next, I placed CDA analysis at 15%. I think that it is important to consistently challenge assumptions that are made about data, and make sure that the conclusions that are drawn are still valid. Coming in at 10% is both collection and organization of data. Although these are indeed important steps in the process, they are also steps that yield data sets that are able to be reused across multiple different analyses. 

(2) The tools of the trade that I think should be acquired from the core courses include:
    - the ability to think critically and solve problems 
    - the ability to apply what is learned in the degree program to the "real" world
    - how to use programming languages to help analyze data.
Right now, I think that my understanding / proficiency of all three of these "tools of the trade" would be ranked at an intermediate level. I think that this is reasonable, considering I am taking a wide variety of courses that introduce me to and help develop skills in many different aspects of data analytics. I don't imagine I will consider myself "proficient" or a master in any of these things until I am in a certain niche of the job market where I will have the opportunity to further develop my skills. 
   
(3) see plot below...

(4) Of the six "skills of the trade", I would rate my skill-level as emerging for each of them. Although I am getting my degree in Data Analytics, it doesn't seem like I have had much experience working with data thus far. Many of the classes that I have taken seem to have fallen more on the "theoretical" side of the spectrum. I think that this course has helped me (or I'm anticipating it will in the future based on where the course is going) to develop some, if not all, of these skills, in particular tracking and documenting the analyses I perform, experimenting with different visualizations in search of an optimal "one graphic" result, and communicating and defending my results. 

(5) Stat 419 has been the most challenging class that I have encountered in my degree program. While this can be frustrating at times, I appreciate that it is not only preparing me to be a better candidate for the job market, but it also allows me to exercise my problem solving skills, which will transfer over into other courses and projects in the future. 
# x is -1 for perfectly theoretical
# x is 1 for perfectly applied

# y is -1 for no care whatsoever for data integrity
# y is 1 is perfect care for data integrity

########################### basic plot setup #####
plot(0,0, col="white", 
  ylim=c(-1.5,1.5), xlim=c(-1.5,1.5),
  xlab = "",
  ylab = "",
  xaxt = 'n', bty = 'n', yaxt = 'n',
  main = "Axiomatic Perspective on Practice/Care",
  );
segments(-1,0,1,0, col="#999999");
segments(0,-1,0,1, col="#999999");
text(-1.1,0, "Theoretical Data Practice", cex=0.5, srt = 90);
text(1.1,0, "Applied Data Practice", cex=0.5, srt = -90);
text(0,1.1, "Great Care for Data Integrity", cex=0.5, srt = 0);
text(0,-1.1, "Little Care for Data Integrity", cex=0.5, srt = 0);
########################### basic plot setup #####


########################### you can add elements here #####
## this point represents the professor's self-perception
# points(0.75, 0.95, pch=20, col="blue");
# text(0.75, 0.95, "Shaffer (self)", col="blue", cex=0.75, srt = 45, pos=2);

#############  TODO ###### ... maybe change color for each data point

# https://brand.wsu.edu/visual/colors/
# crimson = #981e32
# you need to change the x,y from 0,0 ...
# you can change col ... cex (font size), srt (angle), and pos = 1,2,3,4
# 
points(0.25, 0.7, pch=20, col="#981e32");
text(0.25, 0.7, "Jailee (self)", col="#981e32", cex=0.75, srt = 45, pos=3);
# 
# ## evaluate the Course Categories of Tools of the Trade
# ## give them a score
# 
points(-0.9, 0, pch=20, col="#ed3edc");
text(-0.9, 0, "Math(s)", col="#ed3edc", cex=0.5, srt = 45, pos=3);
# 
# 
points(0.3, 0.2, pch=20, col="#613af0");
text(0.3, 0.2, "Computer Science", col="#613af0", cex=0.5, srt = 45, pos=3);
# 
points(0.6, 0.7, pch=20, col="#3abff0");
text(0.6, 0.7, "Machine learning", col="#3abff0", cex=0.5, srt = 45, pos=3);
# 
points(-0.8, 0.7, pch=20, col="#0078cf");
text(-0.8, 0.7, "Statistics", col="#0078cf", cex=0.5, srt = 45, pos=3);
# 
points(0.7, 0.8, pch=20, col="#1ac489");
text(0.7, 0.8, "Data analytics", col="#1ac489", cex=0.5, srt = 45, pos=1);
# 
# # You have a track (e.g., Business)
points(-0.4, 0.3, pch=20, col="#f09116");
text(-0.4, 0.3, "Core discipline - Actuarial Science", col="#f09116", cex=0.5, srt = 45, pos=3);